Shim Heejung, Larget Bret
Department of Statistics, Purdue University, West Lafayette, Indiana, U.S.A.
Departments of Statistics and of Botany, University of Wisconsin, Madison, Wisconsin, U.S.A.
Biometrics. 2018 Mar;74(1):270-279. doi: 10.1111/biom.12640. Epub 2017 Jan 18.
Traditionally, phylogeny and sequence alignment are estimated separately: first estimate a multiple sequence alignment and then infer a phylogeny based on the sequence alignment estimated in the previous step. However, uncertainty in the alignment is ignored, resulting, possibly, in overstated certainty in phylogeny estimates. We develop a joint model for co-estimating phylogeny and sequence alignment which improves estimates from the traditional approach by accounting for uncertainty in the alignment in phylogenetic inferences. Our insertion and deletion (indel) model allows arbitrary-length overlapping indel events and a general distribution for indel fragment size. We employ a Bayesian approach using MCMC to estimate the joint posterior distribution of a phylogenetic tree and a multiple sequence alignment. Our approach has a tree and a complete history of indel events mapped onto the tree as the state space of the Markov Chain while alternative previous approaches have a tree and an alignment. A large state space containing a complete history of indel events makes our MCMC approach more challenging, but it enables us to infer more information about the indel process. The performances of this joint method and traditional sequential methods are compared using simulated data as well as real data. Software named BayesCAT (Bayesian Co-estimation of Alignment and Tree) is available at https://github.com/heejungshim/BayesCAT.
传统上,系统发育和序列比对是分别估计的:首先估计多重序列比对,然后基于上一步估计的序列比对推断系统发育。然而,比对中的不确定性被忽略了,这可能导致系统发育估计中的确定性被高估。我们开发了一种联合模型,用于共同估计系统发育和序列比对,该模型通过在系统发育推断中考虑比对的不确定性,改进了传统方法的估计。我们的插入和缺失(indel)模型允许任意长度的重叠indel事件以及indel片段大小的一般分布。我们采用贝叶斯方法,使用MCMC来估计系统发育树和多重序列比对的联合后验分布。我们的方法以一棵树以及映射到该树上的indel事件的完整历史作为马尔可夫链的状态空间,而之前的替代方法则以一棵树和一个比对作为状态空间。包含indel事件完整历史的大状态空间使我们的MCMC方法更具挑战性,但它使我们能够推断出更多关于indel过程的信息。使用模拟数据和真实数据比较了这种联合方法和传统顺序方法的性能。名为BayesCAT(比对和树的贝叶斯共同估计)的软件可在https://github.com/heejungshim/BayesCAT上获取。