Long Colby, Kubatko Laura
Department of Mathematical and Computational Sciences, College of Wooster, Wooster, OH, United States.
Department of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH, United States.
Front Genet. 2021 Jul 2;12:664357. doi: 10.3389/fgene.2021.664357. eCollection 2021.
A phylogenetic model of sequence evolution for a set of taxa is a collection of probability distributions on the 4 possible site patterns that may be observed in their aligned DNA sequences. For a four-taxon model, one can arrange the entries of these probability distributions into three flattening matrices that correspond to the three different unrooted leaf-labeled four-leaf trees, or quartet trees. The flattening matrix corresponding to the tree parameter of the model is known to satisfy certain rank conditions. Methods such as ErikSVD and SVDQuartets take advantage of this observation by applying singular value decomposition to flattening matrices consisting of empirical data. Each possible quartet is assigned an "SVD score" based on how close the flattening is to the set of matrices of the predicted rank. When choosing among possible quartets, the one with the lowest score is inferred to be the phylogeny of the four taxa under consideration. Since an -leaf phylogenetic tree is determined by its quartets, this approach can be generalized to infer larger phylogenies. In this article, we explore using the SVD score as a test statistic to test whether phylogenetic data were generated by a particular quartet tree. To do so, we use several results to approximate the distribution of the SVD score and to give upper bounds on the -value of the associated hypothesis tests. We also apply these hypothesis tests to simulated phylogenetic data and discuss the implications for interpreting SVD scores in rank-based inference methods.
一组分类群的序列进化系统发育模型是其比对后的DNA序列中可能观察到的4种可能位点模式上的概率分布集合。对于四分类群模型,可以将这些概率分布的条目排列成三个扁平化矩阵,它们对应于三种不同的无根叶标记四叶树,即四重树。已知与模型的树参数对应的扁平化矩阵满足某些秩条件。诸如ErikSVD和SVDQuartets等方法利用这一观察结果,对由经验数据组成的扁平化矩阵应用奇异值分解。根据扁平化与预测秩矩阵集的接近程度,为每个可能的四重树分配一个“SVD分数”。在选择可能的四重树时,分数最低的那个被推断为所考虑的四个分类群的系统发育。由于一个n叶系统发育树由其四重树决定,这种方法可以推广到推断更大的系统发育。在本文中,我们探索使用SVD分数作为检验统计量,以检验系统发育数据是否由特定的四重树生成。为此,我们使用几个结果来近似SVD分数的分布,并给出相关假设检验的p值的上界。我们还将这些假设检验应用于模拟的系统发育数据,并讨论在基于秩的推断方法中解释SVD分数的意义。