使用SHAPES评估蛋白质结构的生成模型覆盖率。

Assessing generative model coverage of protein structures with SHAPES.

作者信息

Lu Tianyu, Liu Melissa, Chen Yilin, Kim Jinho, Huang Po-Ssu

机构信息

Department of Bioengineering, Stanford University, Stanford, CA, USA.

Department of Physics, Stanford University, Stanford, CA, USA.

出版信息

Cell Syst. 2025 Jul 23:101347. doi: 10.1016/j.cels.2025.101347.

DOI:10.1016/j.cels.2025.101347

PMID:40738113

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12321228/

Abstract

Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs that are critical for function. We introduce SHAPES (structural and hierarchical assessment of proteins with embedding similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet protein distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures. The frequency of tertiary motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space. A record of this paper's transparent peer review process is included in the supplemental information.

摘要

生成式建模的最新进展使得能够高效地对蛋白质结构进行采样，但其为可设计性进行优化的倾向会导致偏向理想化结构，而以对功能至关重要的环和其他复杂结构基序为代价。我们引入了SHAPES（具有嵌入相似性的蛋白质结构和层次评估）来评估五种最先进的蛋白质结构生成模型。利用跨越多个结构层次的结构嵌入，从局部几何形状到全局蛋白质结构，我们揭示了这些模型对观察到的蛋白质结构空间的大量欠采样。我们使用弗雷歇蛋白质距离（FPD）来量化分布覆盖范围。不同的模型在不同的采样噪声尺度和温度下的覆盖行为各不相同。三级基序（TERMs）的频率进一步支持了这些观察结果。更强大的序列设计和结构预测方法可能对于指导具有更好可设计蛋白质空间覆盖范围的模型的开发至关重要。本文透明同行评审过程的记录包含在补充信息中。

相似文献

Assessing generative model coverage of protein structures with SHAPES.使用SHAPES评估蛋白质结构的生成模型覆盖率。

Cell Syst. 2025 Jul 23:101347. doi: 10.1016/j.cels.2025.101347.

Assessing Generative Model Coverage of Protein Structures with SHAPES.使用SHAPES评估蛋白质结构的生成模型覆盖率。

bioRxiv. 2025 Jan 17:2025.01.09.632260. doi: 10.1101/2025.01.09.632260.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作：定性证据综合评价。

Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.

Unveiling the evolution of policies for enhancing protein structure predictions: A comprehensive analysis.揭示增强蛋白质结构预测政策的演变：全面分析。

Comput Biol Med. 2024 Sep;179:108815. doi: 10.1016/j.compbiomed.2024.108815. Epub 2024 Jul 11.

Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义

APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.

Unsupervised learning reveals landscape of local structural motifs across protein classes.无监督学习揭示了跨蛋白质类别的局部结构基序格局。

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf377.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Incentives for preventing smoking in children and adolescents.预防儿童和青少年吸烟的激励措施。

Cochrane Database Syst Rev. 2017 Jun 6;6(6):CD008645. doi: 10.1002/14651858.CD008645.pub3.

Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用：对临床预测模型的影响。

J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.

引用本文的文献

Conditional Protein Structure Generation with Protpardelle-1c.使用Protpardelle-1c进行条件蛋白质结构生成。

bioRxiv. 2025 Aug 18:2025.08.18.670959. doi: 10.1101/2025.08.18.670959.

ProT-GFDM: A generative fractional diffusion model for protein generation.ProT-GFDM：一种用于蛋白质生成的生成式分数扩散模型。

Comput Struct Biotechnol J. 2025 Aug 5;27:3464-3480. doi: 10.1016/j.csbj.2025.07.045. eCollection 2025.

Complete computational design of high-efficiency Kemp elimination enzymes.高效肯普消除酶的完整计算设计

Nature. 2025 Jun 18. doi: 10.1038/s41586-025-09136-2.

An improved model for prediction of de novo designed proteins with diverse geometries.一种用于预测具有不同几何形状的从头设计蛋白质的改进模型。

bioRxiv. 2025 Jun 6:2025.06.02.657515. doi: 10.1101/2025.06.02.657515.

本文引用的文献

Atomic context-conditioned protein sequence design using LigandMPNN.使用配体消息传递神经网络进行原子上下文条件蛋白质序列设计。

Nat Methods. 2025 Apr;22(4):717-723. doi: 10.1038/s41592-025-02626-1. Epub 2025 Mar 28.

Simulating 500 million years of evolution with a language model.用语言模型模拟5亿年的进化历程。

Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.

Target-conditioned diffusion generates potent TNFR superfamily antagonists and agonists.靶点条件性扩散产生强效的肿瘤坏死因子受体超家族拮抗剂和激动剂。

Science. 2024 Dec 6;386(6726):1154-1161. doi: 10.1126/science.adp1779. Epub 2024 Dec 5.

An all-atom protein generative model.全原子蛋白质生成模型。

Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311500121. doi: 10.1073/pnas.2311500121. Epub 2024 Jun 25.

Validation of de novo designed water-soluble and transmembrane β-barrels by in silico folding and melting.从头设计的水溶性和跨膜β-桶的通过计算折叠和熔融的验证。

Protein Sci. 2024 Jul;33(7):e5033. doi: 10.1002/pro.5033.

Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。

Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.

Sparks of function by de novo protein design.从头设计蛋白质的功能火花。

Nat Biotechnol. 2024 Feb;42(2):203-215. doi: 10.1038/s41587-024-02133-2. Epub 2024 Feb 15.

De novo protein design-From new structures to programmable functions.从头设计蛋白质——从新结构到可编程功能。

Cell. 2024 Feb 1;187(3):526-544. doi: 10.1016/j.cell.2023.12.028.

De novo design of high-affinity binders of bioactive helical peptides.从头设计高亲和力结合物的生物活性螺旋肽。

Nature. 2024 Feb;626(7998):435-442. doi: 10.1038/s41586-023-06953-1. Epub 2023 Dec 18.

Illuminating protein space with a programmable generative model.用可编程生成模型照亮蛋白质空间。

Nature. 2023 Nov;623(7989):1070-1078. doi: 10.1038/s41586-023-06728-8. Epub 2023 Nov 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验