Baele Guy, Van de Peer Yves, Vansteelandt Stijn
Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium.
Syst Biol. 2008 Oct;57(5):675-92. doi: 10.1080/10635150802422324.
In this article, we present a likelihood-based framework for modeling site dependencies. Our approach builds upon standard evolutionary models but incorporates site dependencies across the entire tree by letting the evolutionary parameters in these models depend upon the ancestral states at the neighboring sites. It thus avoids the need for introducing new and high-dimensional evolutionary models for site-dependent evolution. We propose a Markov chain Monte Carlo approach with data augmentation to infer the evolutionary parameters under our model. Although our approach allows for wide-ranging site dependencies, we illustrate its use, in two non-coding datasets, in the case of nearest-neighbor dependencies (i.e., evolution directly depending only upon the immediate flanking sites). The results reveal that the general time-reversible model with nearest-neighbor dependencies substantially improves the fit to the data as compared to the corresponding model with site independence. Using the parameter estimates from our model, we elaborate on the importance of the 5-methylcytosine deamination process (i.e., the CpG effect) and show that this process also depends upon the 5' neighboring base identity. We hint at the possibility of a so-called TpA effect and show that the observed substitution behavior is very complex in the light of dinucleotide estimates. We also discuss the presence of CpG effects in a nuclear small subunit dataset and find significant evidence that evolutionary models incorporating context-dependent effects perform substantially better than independent-site models and in some cases even outperform models that incorporate varying rates across sites.
在本文中,我们提出了一个基于似然性的框架来对位点依赖性进行建模。我们的方法建立在标准进化模型的基础之上,但通过让这些模型中的进化参数取决于相邻位点的祖先状态,将位点依赖性纳入到整个树中。因此,它避免了为位点依赖性进化引入新的高维进化模型的必要性。我们提出了一种带有数据增强的马尔可夫链蒙特卡罗方法,以推断我们模型下的进化参数。尽管我们的方法允许广泛的位点依赖性,但我们在两个非编码数据集的案例中,展示了其在最近邻依赖性(即进化直接仅取决于紧邻侧翼位点)情况下的应用。结果表明,与具有位点独立性的相应模型相比,具有最近邻依赖性的一般时间可逆模型显著改善了对数据的拟合。利用我们模型的参数估计,我们详细阐述了5 - 甲基胞嘧啶脱氨过程(即CpG效应)的重要性,并表明该过程还取决于5' 相邻碱基的身份。我们暗示了所谓的TpA效应的可能性,并表明根据二核苷酸估计,观察到的替换行为非常复杂。我们还讨论了核小亚基数据集中CpG效应的存在,并发现有重要证据表明,纳入上下文依赖效应的进化模型比独立位点模型表现得更好,在某些情况下甚至优于纳入位点间不同速率的模型。