Bayzid Md Shamsuzzoha, Warnow Tandy
1Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.
2Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, USA.
Algorithms Mol Biol. 2018 Jan 19;13:1. doi: 10.1186/s13015-017-0120-1. eCollection 2018.
Species tree estimation from gene trees can be complicated by gene duplication and loss, and "gene tree parsimony" (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some (i.e., gene trees lacking one or more of the species).
We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the "standard" calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup.
从基因树估计物种树可能会因基因复制和丢失而变得复杂,“基因树简约法”(GTP)是一种从多个基因树估计物种树的方法。在其标准形式中,目标是找到一棵物种树,相对于输入的基因树集合,使基因复制和丢失的总数最小化。尽管对GTP已经了解很多,但对于如何处理包含一些缺失物种(即缺少一个或多个物种的基因树)的输入却知之甚少。
我们提出了关于GTP的新理论,考虑不完整性是由于基因产生和死亡(即真正的生物学丢失)还是分类群抽样导致的,并提出了动态规划算法,该算法可用于少量分类群的精确但指数时间的解决方案,或作为大量分类群的启发式方法。我们还证明,当不完整性是由分类群抽样导致时,复制和丢失的“标准”计算能准确求解GTP,尽管当不完整性是由真正的生物学丢失导致时它们可能不正确。用于动态规划算法的软件可作为开源代码在https://github.com/smirarab/DynaDup上免费获取。