Department of Chemistry , Michigan State University , 578 S. Shaw Lane , East Lansing , Michigan 48824 , United States.
Institute for Cyber Enabled Research , Michigan State University , 567 Wilson Road , East Lansing , Michigan 48824 , United States.
J Chem Inf Model. 2019 May 28;59(5):1919-1929. doi: 10.1021/acs.jcim.8b00734. Epub 2019 Feb 20.
Knowledge-based potentials generally perform better than physics-based scoring functions in detecting the native structure from a collection of decoy protein structures. Through the use of a reference state, the pure interactions between atom/residue pairs can be obtained through the removal of contributions from ideal-gas state potentials. However, it is a challenge for conventional knowledge-based potentials to assign different importance factors to different atom/residue pairs. In this work, via the use of the "comparison" concept, Random Forest (RF) models were successfully generated using unbalanced data sets that assign different importance factors to atom pair potentials to enhance their ability to identify native proteins from decoy proteins. Individual and combined data sets consisting of 12 decoy sets were used to test the performance of the RF models. We find that RF models increase the recognition of native structures without affecting their ability to identify the best decoy structures. We also created models using scrambled atom types, which create physically unrealistic probability functions in order to test the ability of the RF algorithm to create useful models based on inputted scrambled probability functions. From this test, we find that we are unable to create models that are of similar quality relative to the unscrambled probability functions. Next, we created uniform probability functions where the peak positions are the same as the original, but each interaction has the same peak height. Using these uniform potentials, we were able to recover models as good as the ones using the full potentials suggesting all that is important in these models are the experimental peak positions. The KECSA2 potential along with all codes used in this work are available at https://github.com/JunPei000/protein_folding-decoy-set .
基于知识的势能通常比基于物理的评分函数在从一系列诱饵蛋白质结构中检测天然结构方面表现更好。通过使用参考状态,可以通过去除理想气体状态势能的贡献来获得原子/残基对之间的纯相互作用。然而,对于传统的基于知识的势能来说,为不同的原子/残基对分配不同的重要因素是一个挑战。在这项工作中,通过使用“比较”的概念,成功地使用不平衡数据集生成了随机森林(RF)模型,这些数据集为原子对势能分配不同的重要因素,以增强其从诱饵蛋白中识别天然蛋白的能力。使用由 12 个诱饵集组成的单个和组合数据集来测试 RF 模型的性能。我们发现,RF 模型提高了对天然结构的识别能力,同时又不影响其识别最佳诱饵结构的能力。我们还使用打乱的原子类型创建了模型,这些模型创建了物理上不现实的概率函数,以测试 RF 算法基于输入的打乱概率函数创建有用模型的能力。通过这项测试,我们发现我们无法创建与未打乱概率函数质量相当的模型。接下来,我们创建了均匀概率函数,其中峰位与原始概率函数相同,但每个相互作用的峰高相同。使用这些均匀的势能,我们能够恢复与使用完整势能一样好的模型,这表明这些模型中重要的是实验峰位。KECSA2 势能以及本工作中使用的所有代码都可在 https://github.com/JunPei000/protein_folding-decoy-set 上获得。