Larson Gary, Thorne Jeffrey L, Schmidler Scott
Department of Statistical Science, Duke University, Durham, North Carolina.
Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina.
J Comput Biol. 2020 Mar;27(3):361-375. doi: 10.1089/cmb.2019.0500. Epub 2020 Feb 13.
Evolutionary models of proteins are widely used for statistical sequence alignment and inference of homology and phylogeny. However, the vast majority of these models rely on an unrealistic assumption of independent evolution between sites. Here we focus on the related problem of protein structure alignment, a classic tool of computational biology that is widely used to identify structural and functional similarity and to infer homology among proteins. A site-independent statistical model for protein structural evolution has previously been introduced and shown to significantly improve alignments and phylogenetic inferences compared with approaches that utilize only amino acid sequence information. Here we extend this model to account for correlated evolutionary drift among neighboring amino acid positions. The result is a spatiotemporal model of protein structure evolution, described by a multivariate diffusion process convolved with a spatial birth-death process. This extended site-dependent model (SDM) comes with little additional computational cost or analytical complexity compared with the site-independent model (SIM). We demonstrate that this SDM yields a significant reduction of bias in estimated evolutionary distances and helps further improve phylogenetic tree reconstruction. We also develop a simple model of site-dependent sequence evolution, which we use to demonstrate the bias resulting from the application of standard site-independent sequence evolution models.
蛋白质的进化模型被广泛用于统计序列比对以及同源性和系统发育的推断。然而,这些模型中的绝大多数都依赖于位点间独立进化这一不切实际的假设。在此,我们关注蛋白质结构比对这一相关问题,它是计算生物学的一种经典工具,被广泛用于识别蛋白质之间的结构和功能相似性以及推断同源性。之前已经引入了一种用于蛋白质结构进化的位点独立统计模型,并且与仅利用氨基酸序列信息的方法相比,该模型已被证明能显著改善比对和系统发育推断。在此,我们扩展这个模型以考虑相邻氨基酸位置之间的相关进化漂移。结果是一个蛋白质结构进化的时空模型,由一个与空间生死过程卷积的多元扩散过程描述。与位点独立模型(SIM)相比,这个扩展的位点依赖模型(SDM)几乎没有额外的计算成本或分析复杂性。我们证明这个SDM能显著减少估计进化距离中的偏差,并有助于进一步改进系统发育树的重建。我们还开发了一个位点依赖序列进化的简单模型,用于展示应用标准位点独立序列进化模型所产生的偏差。