Nye Tom M W, Tang Xiaoxian, Weyenberg Grady, Yoshida Ruriko
School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU,
Department of Mathematics, Texas A&M University, College Station, Texas 77843,
Biometrika. 2017 Dec;104(4):901-922. doi: 10.1093/biomet/asx047. Epub 2017 Sep 27.
Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic analysis of gene sequences typically produces a collection of these trees, one for each gene in the analysis. Analysis of samples of trees is difficult due to the multi-dimensionality of the space of possible trees. In Euclidean spaces, principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample's structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree space similar to the [Formula: see text]th principal component in Euclidean space: the locus of the weighted Fréchet mean of [Formula: see text] vertex trees when the weights vary over the [Formula: see text]-simplex. We establish some basic properties of these objects, in particular showing that they have dimension [Formula: see text], and propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second principal components.
进化关系由系统发育树表示,对基因序列进行系统发育分析通常会产生这些树的集合,分析中的每个基因对应一棵。由于可能的树空间具有多维性,对树样本进行分析很困难。在欧几里得空间中,主成分分析是一种将高维数据降维为低维表示的常用方法,该低维表示保留了样本的大部分结构。然而,固定物种集上所有系统发育树的空间并不构成欧几里得向量空间,因此需要适用于树空间的方法。先前的工作引入了该空间中主测地线的概念,类似于第一主成分。在此,我们为树空间提出一种几何对象,类似于欧几里得空间中的第[公式:见原文]主成分:当权重在[公式:见原文] - 单纯形上变化时,[公式:见原文]个顶点树的加权弗雷歇均值的轨迹。我们建立了这些对象的一些基本性质,特别表明它们的维度为[公式:见原文],并提出了投影到这些曲面上以及找到与树样本相关的主轨迹的算法。模拟研究表明这些算法表现良好,对分别包含顶复门和非洲腔棘鱼基因组的两个数据集的分析揭示了第二主成分中的重要结构。