Song Yun S, Lyngsø Rune, Hein Jotun
Department of Computer Science, University of California at Davis, 2063 Kemper Hall, One Shields Avenue, Davis, CA 95616, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2006 Jul-Sep;3(3):239-51. doi: 10.1109/TCBB.2006.31.
Given a set D of input sequences, a genealogy for D can be constructed backward in time using such evolutionary events as mutation, coalescent, and recombination. An ancestral configuration (AC) can be regarded as the multiset of all sequences present at a particular point in time in a possible genealogy for D. The complexity of computing the likelihood of observing D depends heavily on the total number of distinct ACs of D and, therefore, it is of interest to estimate that number. For D consisting of binary sequences of finite length, we consider the problem of enumerating exactly all distinct ACs. We assume that the root sequence type is known and that the mutation process is governed by the infinite-sites model. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs. The enumeration problem becomes much more complicated when recombination is involved. In that case, we devise a method of enumeration based on counting contingency tables and construct a dynamic programming algorithm for the approach. Last, we describe a method of counting the number of ACs that can appear in genealogies with less than or equal to a given number R of recombinations. Of particular interest is the case in which R is close to the minimum number of recombinations for D.
给定一组输入序列D,可以利用诸如突变、合并和重组等进化事件,逆时间构建D的系谱。祖先配置(AC)可被视为在D的可能系谱中特定时间点出现的所有序列的多重集。计算观察到D的似然性的复杂度在很大程度上取决于D的不同AC的总数,因此,估计这个数量是很有意义的。对于由有限长度的二进制序列组成的D,我们考虑精确枚举所有不同AC的问题。我们假设根序列类型已知,且突变过程由无限位点模型控制。当不存在重组时,我们构建了一种获得AC总数的封闭形式公式的通用方法。当涉及重组时,枚举问题变得更加复杂。在这种情况下,我们设计了一种基于列联表计数的枚举方法,并为该方法构建了一个动态规划算法。最后,我们描述了一种计算在重组次数小于或等于给定数量R的系谱中可能出现的AC数量的方法。特别值得关注的是R接近D的最小重组次数的情况。