Iwasaki Wataru, Takagi Toshihisa
Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8568, Japan.
Bioinformatics. 2007 Jul 1;23(13):i230-9. doi: 10.1093/bioinformatics/btm165.
Reconstruction of gene-content evolutionary history is fundamental in studying the evolution of genomes and biological systems. To reconstruct plausible evolutionary history, rates of gene gain/loss should be estimated by considering the high level of heterogeneity: e.g. genome duplication and parasitization, respectively, result in high rates of gene gain and loss. Gene-content evolution reconstruction methods that consider this heterogeneity and that are both effective in estimating the rates of gene gain and loss and sufficiently efficient to analyze abundant genomic data had not been developed.
An effective and efficient method for reconstructing heterogeneous gene-content evolution was developed. This method comprises analytically integrable modeling of gene-content evolution, analytical formulation of expectation-maximization and efficient calculation of marginal likelihood using an inside-outside-like algorithm. Simulation tests on the scale of hundreds of genomes showed that both the gene gain/loss rates and evolutionary history were effectively estimated within a few days of computational time. Subsequently, this algorithm was applied to an actual data set of nearly 200 genomes to reconstruct the heterogeneous gene-content evolution across the three domains of life. The reconstructed history, which contained several features consistent with biological observations, showed that the trends of gene-content evolution were not only drastically different between prokaryotes and eukaryotes, but were highly variable within each form of life. The results suggest that heterogeneity should be considered in studies of the evolution of gene content, genomes and biological systems.
An R script that implements the algorithm is available upon request.
基因含量进化历史的重建是研究基因组和生物系统进化的基础。为了重建合理的进化历史,应通过考虑高度的异质性来估计基因获得/丢失的速率:例如,基因组复制和寄生作用分别导致高基因获得率和高基因丢失率。尚未开发出考虑这种异质性且在估计基因获得和丢失速率方面有效且足够高效以分析大量基因组数据的基因含量进化重建方法。
开发了一种有效且高效的重建异质基因含量进化的方法。该方法包括基因含量进化的解析可积建模、期望最大化的解析公式以及使用类似内外算法的边际似然的高效计算。在数百个基因组规模上的模拟测试表明,在几天的计算时间内就能有效地估计基因获得/丢失速率和进化历史。随后,该算法被应用于近200个基因组的实际数据集,以重建生命三个域的异质基因含量进化。重建的历史包含了几个与生物学观察一致的特征,表明基因含量进化的趋势不仅在原核生物和真核生物之间有很大差异,而且在每种生命形式内部也高度可变。结果表明,在基因含量、基因组和生物系统进化的研究中应考虑异质性。
可根据要求提供实现该算法的R脚本。