Department of Cell Research and Immunology, Tel Aviv University, Tel Aviv, Israel.
Genome Biol Evol. 2011;3:1265-75. doi: 10.1093/gbe/evr101. Epub 2011 Oct 4.
Bacterial evolution is characterized by frequent gain and loss events of gene families. These events can be inferred from phyletic pattern data-a compact representation of gene family repertoire across multiple genomes. The maximum parsimony paradigm is a classical and prevalent approach for the detection of gene family gains and losses mapped on specific branches. We and others have previously developed probabilistic models that aim to account for the gain and loss stochastic dynamics. These models are a critical component of a methodology termed stochastic mapping, in which probabilities and expectations of gain and loss events are estimated for each branch of an underlying phylogenetic tree. In this work, we present a phyletic pattern simulator in which the gain and loss dynamics are assumed to follow a continuous-time Markov chain along the tree. Various models and options are implemented to make the simulation software useful for a large number of studies in which binary (presence/absence) data are analyzed. Using this simulation software, we compared the ability of the maximum parsimony and the stochastic mapping approaches to accurately detect gain and loss events along the tree. Our simulations cover a large array of evolutionary scenarios in terms of the propensities for gene family gains and losses and the variability of these propensities among gene families. Although in all simulation schemes, both methods obtain relatively low levels of false positive rates, stochastic mapping outperforms maximum parsimony in terms of true positive rates. We further studied the factors that influence the performance of both methods. We find, for example, that the accuracy of maximum parsimony inference is substantially reduced when the goal is to map gain and loss events along internal branches of the phylogenetic tree. Furthermore, the accuracy of stochastic mapping is reduced with smaller data sets (limited number of gene families) due to unreliable estimation of branch lengths. Our simulator and simulation results are additionally relevant for the analysis of other types of binary-coded data, such as the existence of homologues restriction sites, gaps, and introns, to name a few. Both the simulation software and the inference methodology are freely available at a user-friendly server: http://gloome.tau.ac.il/.
细菌进化的特点是基因家族的频繁获得和缺失事件。这些事件可以从系统发育模式数据中推断出来,系统发育模式数据是跨越多个基因组的基因家族库的紧凑表示。最大简约范式是检测映射到特定分支上的基因家族获得和缺失的经典和流行方法。我们和其他人之前开发了旨在解释获得和损失随机动态的概率模型。这些模型是一种称为随机映射的方法的关键组成部分,在该方法中,为基础系统发育树的每个分支估计获得和损失事件的概率和期望。在这项工作中,我们提出了一种系统发育模式模拟器,其中获得和损失动态被假设沿树遵循连续时间马尔可夫链。实现了各种模型和选项,以使模拟软件可用于大量分析二进制(存在/不存在)数据的研究。使用此模拟软件,我们比较了最大简约法和随机映射法准确检测树中获得和损失事件的能力。我们的模拟涵盖了基因家族获得和损失倾向以及这些倾向在基因家族之间的变异性方面的大量进化场景。尽管在所有模拟方案中,两种方法的假阳性率都相对较低,但在真阳性率方面,随机映射优于最大简约法。我们进一步研究了影响这两种方法性能的因素。例如,当目标是沿系统发育树的内部分支映射获得和缺失事件时,最大简约法推断的准确性会大大降低。此外,由于分支长度的估计不可靠,因此随着数据集(基因家族数量有限)的减小,随机映射的准确性会降低。我们的模拟器和模拟结果对于分析其他类型的二进制编码数据也很重要,例如同源限制位点、间隙和内含子的存在等。模拟软件和推理方法都可在用户友好的服务器上免费获得:http://gloome.tau.ac.il/。