Nordquist Erik, Zhang Guohui, Barethiya Shrishti, Ji Nathan, White Kelli M, Han Lu, Jia Zhiguang, Shi Jingyi, Cui Jianmin, Chen Jianhan
Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA.
Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA.
bioRxiv. 2023 Jun 26:2023.06.24.546384. doi: 10.1101/2023.06.24.546384.
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.
Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.
机器学习在众多化学和生物物理问题中发挥了变革性作用,比如在存在大量数据的蛋白质折叠问题中。然而,由于数据稀缺的限制,许多重要问题对于数据驱动的机器学习方法来说仍然具有挑战性。克服数据稀缺的一种方法是纳入物理原理,例如通过分子建模和模拟。在这里,我们聚焦于在心血管和神经系统中起重要作用的大电导钾(BK)通道。BK通道的许多突变体与各种神经和心血管疾病相关,但分子效应尚不清楚。在过去三十年中,已经通过实验对473个位点特异性突变的BK通道的电压门控特性进行了表征;然而,这些功能数据本身仍然过于稀疏,无法得出BK通道电压门控的预测模型。使用基于物理的建模,我们量化了所有单个突变对通道开放和关闭状态的能量效应。连同从原子模拟得出的动力学特性,这些物理描述符允许训练随机森林模型,该模型可以重现未见过的实验测量的门控电压变化ΔV,均方根误差约为32 mV,相关系数R约为0.7。重要的是,该模型似乎能够揭示通道门控背后的重要物理原理,包括疏水门控的核心作用。使用S5螺旋上L235和V236的四个新突变对该模型进行了进一步评估,预计这些突变对V有相反的影响,并表明S5在介导电压传感器 - 孔耦合中起关键作用。测量的ΔV与所有四个突变的预测在数量上一致,相关性很高,R = 0.92,均方根误差 = 18 mV。因此,该模型可以在已知突变很少的区域捕获重要的电压门控特性。BK电压门控预测建模的成功证明了结合物理和统计学习以克服非平凡蛋白质功能预测中数据稀缺的潜力。
深度机器学习在化学、物理和生物学方面带来了许多令人兴奋的突破。这些模型需要大量的训练数据,并且在数据稀缺时会遇到困难。对于诸如离子通道等复杂蛋白质功能的预测建模来说,情况确实如此,在这种情况下可能只有数百个突变数据可用。使用大电导钾(BK)通道作为一个生物学上重要的模型系统,我们证明通过纳入物理衍生特征,仅从473个突变数据就可以得出其电压门控特性的可靠预测模型,这些特征包括分子动力学模拟的动力学特性和Rosetta突变计算的能量量。我们表明最终的随机森林模型捕获了BK电压门控突变效应的关键趋势和热点,例如孔疏水性的重要作用。一个特别有趣的预测是S5螺旋上两个相邻残基的突变对门控电压总是有相反的影响,这通过四个新突变的实验表征得到了证实。当前的工作证明了在数据稀缺的蛋白质功能预测建模中纳入物理的重要性和有效性。