Huang Yao-Ming, Bystroff Christopher
University of California, San Francisco, San Francisco.
Rensselaer Polytechnic Institute, Troy.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1176-87. doi: 10.1109/TCBB.2013.113.
Nature possesses a secret formula for the energy as a function of the structure of a protein. In protein design, approximations are made to both the structural representation of the molecule and to the form of the energy equation, such that the existence of a general energy function for proteins is by no means guaranteed. Here, we present new insights toward the application of machine learning to the problem of finding a general energy function for protein design. Machine learning requires the definition of an objective function, which carries with it the implied definition of success in protein design. We explored four functions, consisting of two functional forms, each with two criteria for success. Optimization was carried out by a Monte Carlo search through the space of all variable parameters. Cross-validation of the optimized energy function against a test set gave significantly different results depending on the choice of objective function, pointing to relative correctness of the built-in assumptions. Novel energy cross terms correct for the observed nonadditivity of energy terms and an imbalance in the distribution of predicted amino acids. This paper expands on the work presented at the 2012 ACM-BCB.
自然界拥有一个关于能量与蛋白质结构关系的秘密公式。在蛋白质设计中,对于分子的结构表示和能量方程的形式都进行了近似处理,因此蛋白质通用能量函数的存在绝无保证。在此,我们展示了关于将机器学习应用于寻找蛋白质设计通用能量函数问题的新见解。机器学习需要定义一个目标函数,而这个目标函数隐含着蛋白质设计成功的定义。我们探索了四个函数,由两种函数形式组成,每种形式都有两个成功标准。通过对所有可变参数空间进行蒙特卡罗搜索来进行优化。根据目标函数的选择,针对测试集对优化后的能量函数进行交叉验证会得出显著不同的结果,这表明内置假设存在相对正确性。新颖的能量交叉项修正了所观察到的能量项非加和性以及预测氨基酸分布的不平衡。本文是对2012年ACM - BCB会议上所展示工作的拓展。