Shen Mingzhe, Dayhoff Guy W, Shen Jana
Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, U.S.A.
Joint first author.
Res Sq. 2025 Apr 28:rs.3.rs-6471091. doi: 10.21203/rs.3.rs-6471091/v1.
Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate p predictions through KaML-ESM, a model pretrained on a synthetic p dataset that leverages evolutionary representations from large-scale protein language models ESMs. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of ~0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units - with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein p values. Our results support the notation that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. Lastly, we provide KaML, a sequence-based end-to-end ML platform that enables researchers to map protein electrostatic landscapes, facilitating applications ranging from drug design and protein engineering to molecular simulations.
蛋白质电离状态提供静电力来调节蛋白质的结构、稳定性、溶解性和功能。到目前为止,预测电离状态和理解蛋白质静电学一直依赖于结构信息。在此,我们证明仅通过KaML-ESM,一级序列就能实现非常准确的p预测,KaML-ESM是一个在合成p数据集上预训练的模型,该数据集利用了来自大规模蛋白质语言模型ESMs的进化表示。对于天冬氨酸、谷氨酸、组氨酸和赖氨酸残基,KaML-ESM模型实现的均方根误差接近约0.5个pH单位的实验精度极限,同时将半胱氨酸的预测误差降低到1.1个单位——随着训练数据集的扩大,预计还会进一步改善。通过外部评估,包括对蛋白质p值的全蛋白质组分析,进一步验证了KaML-ESM的最先进性能。我们的结果支持这样一种观点,即蛋白质序列不仅编码结构和功能,还编码静电特性,这些特性可能已经通过进化共同优化。最后,我们提供了KaML,这是一个基于序列的端到端机器学习平台,使研究人员能够绘制蛋白质静电图谱,促进从药物设计、蛋白质工程到分子模拟等各种应用。