Lennox Kristin P, Dahl David B, Vannucci Marina, Day Ryan, Tsai Jerry W
Department of Statistics, Texas A&M University, 3143 TAMU, College Station, Texas 77843-3143, USA,
Ann Appl Stat. 2010 Jun 1;4(2):916-942. doi: 10.1214/09-AOAS296.
By providing new insights into the distribution of a protein's torsion angles, recent statistical models for this data have pointed the way to more efficient methods for protein structure prediction. Most current approaches have concentrated on bivariate models at a single sequence position. There is, however, considerable value in simultaneously modeling angle pairs at multiple sequence positions in a protein. One area of application for such models is in structure prediction for the highly variable loop and turn regions. Such modeling is difficult due to the fact that the number of known protein structures available to estimate these torsion angle distributions is typically small. Furthermore, the data is "sparse" in that not all proteins have angle pairs at each sequence position. We propose a new semiparametric model for the joint distributions of angle pairs at multiple sequence positions. Our model accommodates sparse data by leveraging known information about the behavior of protein secondary structure. We demonstrate our technique by predicting the torsion angles in a loop from the globin fold family. Our results show that a template-based approach can now be successfully extended to modeling the notoriously difficult loop and turn regions.
通过提供有关蛋白质扭转角分布的新见解,最近针对该数据的统计模型为蛋白质结构预测的更有效方法指明了方向。当前大多数方法都集中在单个序列位置的双变量模型上。然而,同时对蛋白质中多个序列位置的角对进行建模具有相当大的价值。此类模型的一个应用领域是高度可变的环和转角区域的结构预测。由于用于估计这些扭转角分布的已知蛋白质结构数量通常很少,因此这种建模很困难。此外,数据是“稀疏的”,因为并非所有蛋白质在每个序列位置都有角对。我们提出了一种用于多个序列位置角对联合分布的新半参数模型。我们的模型通过利用有关蛋白质二级结构行为的已知信息来处理稀疏数据。我们通过预测球蛋白折叠家族中环的扭转角来展示我们的技术。我们的结果表明,基于模板的方法现在可以成功扩展到对 notoriously difficult 环和转角区域进行建模。 (注:“notoriously difficult”直译为“臭名昭著地困难”,结合语境意译为“极其困难” )