Suppr超能文献

蛋白质的静电特性通过进化得到精细调节。

Protein Electrostatic Properties are Fine-Tuned Through Evolution.

作者信息

Shen Mingzhe, Dayhoff Guy W, Shen Jana

机构信息

Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, U.S.A.

Joint first author.

出版信息

Res Sq. 2025 Apr 28:rs.3.rs-6471091. doi: 10.21203/rs.3.rs-6471091/v1.

Abstract

Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate p predictions through KaML-ESM, a model pretrained on a synthetic p dataset that leverages evolutionary representations from large-scale protein language models ESMs. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of ~0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units - with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein p values. Our results support the notation that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. Lastly, we provide KaML, a sequence-based end-to-end ML platform that enables researchers to map protein electrostatic landscapes, facilitating applications ranging from drug design and protein engineering to molecular simulations.

摘要

蛋白质电离状态提供静电力来调节蛋白质的结构、稳定性、溶解性和功能。到目前为止,预测电离状态和理解蛋白质静电学一直依赖于结构信息。在此,我们证明仅通过KaML-ESM,一级序列就能实现非常准确的p预测,KaML-ESM是一个在合成p数据集上预训练的模型,该数据集利用了来自大规模蛋白质语言模型ESMs的进化表示。对于天冬氨酸、谷氨酸、组氨酸和赖氨酸残基,KaML-ESM模型实现的均方根误差接近约0.5个pH单位的实验精度极限,同时将半胱氨酸的预测误差降低到1.1个单位——随着训练数据集的扩大,预计还会进一步改善。通过外部评估,包括对蛋白质p值的全蛋白质组分析,进一步验证了KaML-ESM的最先进性能。我们的结果支持这样一种观点,即蛋白质序列不仅编码结构和功能,还编码静电特性,这些特性可能已经通过进化共同优化。最后,我们提供了KaML,这是一个基于序列的端到端机器学习平台,使研究人员能够绘制蛋白质静电图谱,促进从药物设计、蛋白质工程到分子模拟等各种应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aba3/12060968/dfbf81db71a3/nihpp-rs6471091v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验