Suppr超能文献

主成分分析与系统发育树空间中弗雷歇均值的轨迹

Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees.

作者信息

Nye Tom M W, Tang Xiaoxian, Weyenberg Grady, Yoshida Ruriko

机构信息

School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU,

Department of Mathematics, Texas A&M University, College Station, Texas 77843,

出版信息

Biometrika. 2017 Dec;104(4):901-922. doi: 10.1093/biomet/asx047. Epub 2017 Sep 27.

Abstract

Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic analysis of gene sequences typically produces a collection of these trees, one for each gene in the analysis. Analysis of samples of trees is difficult due to the multi-dimensionality of the space of possible trees. In Euclidean spaces, principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample's structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree space similar to the [Formula: see text]th principal component in Euclidean space: the locus of the weighted Fréchet mean of [Formula: see text] vertex trees when the weights vary over the [Formula: see text]-simplex. We establish some basic properties of these objects, in particular showing that they have dimension [Formula: see text], and propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second principal components.

摘要

进化关系由系统发育树表示,对基因序列进行系统发育分析通常会产生这些树的集合,分析中的每个基因对应一棵。由于可能的树空间具有多维性,对树样本进行分析很困难。在欧几里得空间中,主成分分析是一种将高维数据降维为低维表示的常用方法,该低维表示保留了样本的大部分结构。然而,固定物种集上所有系统发育树的空间并不构成欧几里得向量空间,因此需要适用于树空间的方法。先前的工作引入了该空间中主测地线的概念,类似于第一主成分。在此,我们为树空间提出一种几何对象,类似于欧几里得空间中的第[公式:见原文]主成分:当权重在[公式:见原文] - 单纯形上变化时,[公式:见原文]个顶点树的加权弗雷歇均值的轨迹。我们建立了这些对象的一些基本性质,特别表明它们的维度为[公式:见原文],并提出了投影到这些曲面上以及找到与树样本相关的主轨迹的算法。模拟研究表明这些算法表现良好,对分别包含顶复门和非洲腔棘鱼基因组的两个数据集的分析揭示了第二主成分中的重要结构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24cd/5793493/a6dc3996ac83/asx047f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验