Webb Alex, Hancock John M, Holmes Chris C
Department of Statistics, Oxford, UK.
Bioinformatics. 2009 Jan 15;25(2):197-203. doi: 10.1093/bioinformatics/btn607. Epub 2008 Nov 20.
Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.
We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.
用于表征分类单元之间相关性的传统系统发育分析通常假定基因组中每个位点的物种之间存在单一关系。这一假设没有考虑到重组,而重组是产生多样性的一个基本过程,可能会导致虚假结果。重组会诱导一种局部系统发育结构,这种结构可能会沿着基因组发生变化。在这里,我们推广了一种隐马尔可夫模型(HMM),以推断沿着多序列比对的系统发育变化,同时考虑速率异质性;隐藏状态指的是基因组位置相关性背后未观察到的系统发育拓扑结构。隐藏状态(拓扑结构)的数量及其结构的维度是随机的(先验未知),并使用马尔可夫链蒙特卡罗算法进行采样。HMM结构使我们能够在所有可能的拓扑结构变化点以及所有未知分支长度上进行解析积分。
我们在模拟数据上展示了我们的方法,还应用于一种疑似HIV重组毒株的基因组,以及对Perlegen Sciences测序的15个实验室小鼠品系序列中的重组进行的研究。我们的研究结果表明,我们的方法使我们能够区分由重组引起的速率异质性和系统发育变化,而不限于4分类单元数据。