Cohen Ofir, Rubinstein Nimrod D, Stern Adi, Gophna Uri, Pupko Tal
Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.
Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3903-11. doi: 10.1098/rstb.2008.0177.
Probabilistic evolutionary models revolutionized our capability to extract biological insights from sequence data. While these models accurately describe the stochastic processes of site-specific substitutions, single-base substitutions represent only a fraction of all the events that shape genomes. Specifically, in microbes, events in which entire genes are gained (e.g. via horizontal gene transfer) and lost play a pivotal evolutionary role. In this research, we present a novel likelihood-based evolutionary model for gene gains and losses, and use it to analyse genome-wide patterns of the presence and absence of gene families. The model assumes a Markovian stochastic process, where gains and losses are represented by the transition between presence and absence, respectively, given an underlying phylogenetic tree. To account for differences in the rates of gain and loss of different gene families, we assume among-gene family rate variability, thus allowing for more accurate description of the data. Using the Bayesian approach, we estimated an evolutionary rate for each gene family. Simulation studies demonstrated that our methodology accurately infers these rates. Our methodology was applied to analyse a large corpus of data, consisting of 4873 gene families spanning 63 species and revealed novel insights regarding the evolutionary nature of genome-wide gain and loss dynamics.
概率进化模型彻底改变了我们从序列数据中提取生物学见解的能力。虽然这些模型准确地描述了位点特异性替换的随机过程,但单碱基替换仅占塑造基因组的所有事件的一小部分。具体而言,在微生物中,整个基因获得(例如通过水平基因转移)和丢失的事件起着关键的进化作用。在本研究中,我们提出了一种基于似然性的新型基因获得和丢失进化模型,并使用它来分析全基因组范围内基因家族存在和缺失的模式。该模型假设一个马尔可夫随机过程,在给定系统发育树的情况下,获得和丢失分别由存在和不存在之间的转变表示。为了考虑不同基因家族获得和丢失速率的差异,我们假设基因家族间速率变异性,从而能够更准确地描述数据。使用贝叶斯方法,我们估计了每个基因家族的进化速率。模拟研究表明,我们的方法能够准确推断这些速率。我们的方法被应用于分析大量数据语料库,该语料库由涵盖63个物种的4873个基因家族组成,并揭示了关于全基因组获得和丢失动态进化本质方面的新见解。