Schmidler Scott C, Lucas Joseph E, Oas Terrence G
Institute of Statistics and Decision Sciences, Duke University, Durham, NC 27708, USA.
J Comput Biol. 2007 Dec;14(10):1287-310. doi: 10.1089/cmb.2007.0008.
Analysis of biopolymer sequences and structures generally adopts one of two approaches: use of detailed biophysical theoretical models of the system with experimentally-determined parameters, or largely empirical statistical models obtained by extracting parameters from large datasets. In this work, we demonstrate a merger of these two approaches using Bayesian statistics. We adopt a common biophysical model for local protein folding and peptide configuration, the helix-coil model. The parameters of this model are estimated by statistical fitting to a large dataset, using prior distributions based on experimental data. L(1)-norm shrinkage priors are applied to induce sparsity among the estimated parameters, resulting in a significantly simplified model. Formal statistical procedures for evaluating support in the data for previously proposed model extensions are presented. We demonstrate the advantages of this approach including improved prediction accuracy and quantification of prediction uncertainty, and discuss opportunities for statistical design of experiments. Our approach yields a 39% improvement in mean-squared predictive error over the current best algorithm for this problem. In the process we also provide an efficient recursive algorithm for exact calculation of ensemble helicity including sidechain interactions, and derive an explicit relation between homo- and heteropolymer helix-coil theories and Markov chains and (non-standard) hidden Markov models respectively, which has not appeared in the literature previously.
使用具有实验确定参数的系统详细生物物理理论模型,或通过从大型数据集中提取参数获得的主要是经验性的统计模型。在这项工作中,我们展示了使用贝叶斯统计将这两种方法合并。我们采用了一种用于局部蛋白质折叠和肽构型的常见生物物理模型,即螺旋-卷曲模型。该模型的参数通过对大型数据集进行统计拟合来估计,使用基于实验数据的先验分布。应用L(1)范数收缩先验来诱导估计参数之间的稀疏性,从而得到一个显著简化的模型。提出了用于评估数据对先前提出的模型扩展的支持的形式统计程序。我们展示了这种方法的优点,包括提高预测准确性和量化预测不确定性,并讨论了实验统计设计的机会。我们的方法在该问题上的均方预测误差比当前最佳算法提高了39%。在此过程中,我们还提供了一种用于精确计算包括侧链相互作用的系综螺旋度的高效递归算法,并分别推导了均聚物和杂聚物螺旋-卷曲理论与马尔可夫链和(非标准)隐马尔可夫模型之间的明确关系,这在以前的文献中尚未出现。