Bansal Mukul S, Alm Eric J, Kellis Manolis
1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , Cambridge, Massachusetts.
J Comput Biol. 2013 Oct;20(10):738-54. doi: 10.1089/cmb.2013.0073. Epub 2013 Sep 14.
Phylogenetic tree reconciliation is a powerful approach for inferring evolutionary events like gene duplication, horizontal gene transfer, and gene loss, which are fundamental to our understanding of molecular evolution. While duplication-loss (DL) reconciliation leads to a unique maximum-parsimony solution, duplication-transfer-loss (DTL) reconciliation yields a multitude of optimal solutions, making it difficult to infer the true evolutionary history of the gene family. This problem is further exacerbated by the fact that different event cost assignments yield different sets of optimal reconciliations. Here, we present an effective, efficient, and scalable method for dealing with these fundamental problems in DTL reconciliation. Our approach works by sampling the space of optimal reconciliations uniformly at random and aggregating the results. We show that even gene trees with only a few dozen genes often have millions of optimal reconciliations and present an algorithm to efficiently sample the space of optimal reconciliations uniformly at random in O(mn(2)) time per sample, where m and n denote the number of genes and species, respectively. We use these samples to understand how different optimal reconciliations vary in their node mappings and event assignments and to investigate the impact of varying event costs. We apply our method to a biological dataset of approximately 4700 gene trees from 100 taxa and observe that 93% of event assignments and 73% of mappings remain consistent across different multiple optima. Our analysis represents the first systematic investigation of the space of optimal DTL reconciliations and has many important implications for the study of gene family evolution.
系统发育树比对是推断诸如基因复制、水平基因转移和基因丢失等进化事件的有力方法,这些事件对于我们理解分子进化至关重要。虽然复制-丢失(DL)比对会得出唯一的最大简约解,但复制-转移-丢失(DTL)比对会产生大量最优解,这使得推断基因家族的真实进化历史变得困难。不同的事件成本分配会产生不同的最优比对集,这一事实进一步加剧了这个问题。在这里,我们提出了一种有效、高效且可扩展的方法来处理DTL比对中的这些基本问题。我们的方法通过在最优比对空间中随机均匀采样并汇总结果来工作。我们表明,即使是只有几十个基因的基因树通常也有上百万个最优比对,并提出了一种算法,以每个样本O(mn(2))的时间在最优比对空间中高效地随机均匀采样,其中m和n分别表示基因和物种的数量。我们使用这些样本了解不同的最优比对在节点映射和事件分配方面如何变化,并研究不同事件成本的影响。我们将我们的方法应用于一个来自100个分类单元的约4700个基因树的生物学数据集,并观察到93%的事件分配和73%的映射在不同的多个最优解中保持一致。我们的分析代表了对最优DTL比对空间的首次系统研究,对基因家族进化研究有许多重要意义。