Origins Institute and Department of Physics and Astronomy, McMaster University, Hamilton, Ontario, Canada.
Mol Biol Evol. 2012 Nov;29(11):3413-25. doi: 10.1093/molbev/mss163. Epub 2012 Jun 29.
When groups of related bacterial genomes are compared, the number of core genes found in all genomes is usually much less than the mean genome size, whereas the size of the pangenome (the set of genes found on at least one of the genomes) is much larger than the mean size of one genome. We analyze 172 complete genomes of Bacilli and compare the properties of the pangenomes and core genomes of monophyletic subsets taken from this group. We then assess the capabilities of several evolutionary models to predict these properties. The infinitely many genes (IMG) model is based on the assumption that each new gene can arise only once. The predictions of the model depend on the shape of the evolutionary tree that underlies the divergence of the genomes. We calculate results for coalescent trees, star trees, and arbitrary phylogenetic trees of predefined fixed branch length. On a star tree, the pangenome size increases linearly with the number of genomes, as has been suggested in some previous studies, whereas on a coalescent tree, it increases logarithmically. The coalescent tree gives a better fit to the data, for all the examples we consider. In some cases, a fixed phylogenetic tree proved better than the coalescent tree at reproducing structure in the gene frequency spectrum, but little improvement was gained in predictions of the core and pangenome sizes. Most of the data are well explained by a model with three classes of gene: an essential class that is found in all genomes, a slow class whose rate of origination and deletion is slow compared with the time of divergence of the genomes, and a fast class showing rapid origination and deletion. Although the majority of genes originating in a genome are in the fast class, these genes are not retained for long periods, and the majority of genes present in a genome are in the slow or essential classes. In general, we show that the IMG model is useful for comparison with experimental genome data both for species level and widely divergent taxonomic groups. Software implementing the described formulae is provided at http://github.com/rec3141/pangenome.
当比较相关细菌基因组群体时,在所有基因组中发现的核心基因数量通常远少于平均基因组大小,而泛基因组(至少在一个基因组中发现的基因集合)的大小远大于一个基因组的平均大小。我们分析了 172 个芽孢杆菌的完整基因组,并比较了从该组中提取的单系子集的泛基因组和核心基因组的性质。然后,我们评估了几种进化模型预测这些特性的能力。无限多基因 (IMG) 模型基于这样的假设,即每个新基因只能出现一次。该模型的预测取决于作为基因组分歧基础的进化树的形状。我们为合并树、星状树和预定固定分支长度的任意系统发育树计算结果。在星状树上,如一些先前的研究中所建议的,随着基因组数量的增加,泛基因组大小呈线性增加,而在合并树上,它呈对数增加。对于我们考虑的所有示例,合并树更能拟合数据。在某些情况下,固定的系统发育树在复制基因频率谱中的结构方面比合并树表现更好,但在核心和泛基因组大小的预测方面几乎没有改进。大多数数据都可以很好地用具有三类基因的模型来解释:一类是所有基因组中都存在的必需基因,一类是与基因组分歧时间相比起源和删除速度较慢的慢基因,一类是起源和删除速度较快的快基因。虽然起源于一个基因组的大多数基因都属于快基因,但这些基因不会长期保留,而存在于一个基因组中的大多数基因都属于慢基因或必需基因。一般来说,我们表明,IMG 模型对于与实验基因组数据的比较是有用的,无论是在物种水平还是在广泛分歧的分类群中。在 http://github.com/rec3141/pangenome 上提供了实现描述公式的软件。