Ferretti V, Lang B F, Sankoff D
Université de Montréal, CRM, Montreal, Quebec, Canada.
J Comput Biol. 1994 Spring;1(1):77-92. doi: 10.1089/cmb.1994.1.77.
Evolutionary inference methods that assume equal DNA base compositions and symmetric nucleotide substitution matrices, where these assumptions do not hold, are likely to group species on the basis of similar base compositions rather than true phylogenetic relationships. We propose an invariants-based method for dealing with this problem. An invariant QT of a tree T under a k-state Markov model, where a generalized time parameter is identified with the E edges of T, allows us to recognize whether data on N observed species can be associated with the N terminal vertices of T in the sense of having been generated on T rather than on any other tree with N terminals. The form of the generalized time parameter is a positive determinant matrix in some semigroup S of stochastic matrices. The invariance is with respect to the choice of the set of E matrices in S, one associated with each of the E edges of T. We apply a general "empirical" method of finding invariants of a parametrized functional form. It involves calculating the probability f of all KN data possibilities for each of m sets of E matrices in S to associate with the edges of T, then solving for the parameters using the m equations of form Q(f) = 0. We discuss the problems of finding asymmetric models satisfying the property of semigroup closure, of finding asymmetric models that admit invariants at all, and of the computational complexity of the method. We propose a class of semigroups Sc containing matrices of form [formula: see text] to account for A+T versus G+C asymmetries in DNA base composition. Quadratic invariants are obtained for rooted trees with three and with four terminals. In the latter case the smallest set of algebraically independent invariants is sought. These invariants are applied to data pertaining the fungal evolution and to the origin of mitochondria as bacterial endosymbionts.
在DNA碱基组成相等且核苷酸替换矩阵对称的假设不成立的情况下,基于进化推理的方法可能会依据相似的碱基组成而非真正的系统发育关系对物种进行分类。我们提出了一种基于不变量的方法来处理这个问题。在k状态马尔可夫模型下,树T的一个不变量QT(其中广义时间参数与T的E条边相关联)使我们能够识别关于N个观察物种的数据是否能与T的N个末端顶点相关联,即这些数据是否是在T上而非任何其他具有N个末端的树上生成的。广义时间参数的形式是某个随机矩阵半群S中的一个正行列式矩阵。这种不变性与S中E个矩阵的集合选择有关,其中每个矩阵与T的E条边之一相关联。我们应用一种通用的“经验”方法来寻找具有参数化函数形式的不变量。它包括计算S中m组E个矩阵与T的边相关联时所有KN种数据可能性的概率f,然后使用形式为Q(f) = 0的m个方程求解参数。我们讨论了寻找满足半群闭包性质的非对称模型、寻找根本允许不变量的非对称模型以及该方法的计算复杂性等问题。我们提出了一类半群Sc,其中包含形式为[公式:见原文]的矩阵,以解释DNA碱基组成中A + T与G + C的不对称性。对于具有三个和四个末端的有根树,我们得到了二次不变量。在后一种情况下,我们寻求最小的代数独立不变量集。这些不变量被应用于与真菌进化以及线粒体作为细菌内共生体起源相关的数据。