Chakerian John, Holmes Susan
Palantir Technologies.
Stanford University, Stanford, CA 94305.
J Comput Graph Stat. 2012;21(3):581-599. doi: 10.1080/10618600.2012.640901. Epub 2012 Aug 16.
Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960s. In bioinformatics, psychometrics, and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and "generalizability" of these summaries. This article describes the implementation of the geometric distance between trees developed by Billera, Holmes, and Vogtmann (2001) equally applicable to phylogenetic trees and hierarchical clustering trees, and shows some of the applications in evaluating tree estimates. In particular, since Billera et al. (2001) have shown that the space of trees is negatively curved (called a CAT(0) space), a collection of trees can naturally be represented as a tree. We compare this representation to the Euclidean approximations of treespace made available through both a classical multidimensional scaling and a Kernel multidimensional scaling of the matrix of the distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence of both certain columns (positions, variables, or genes) and certain rows (species, observations, or arrays) on the construction of such trees. It also can provide a way of detecting heterogeneous mixtures in the input data. Supplementary materials for this article are available online.
在进化生物学领域,自20世纪60年代以来,系统发育树已根据DNA数据构建完成,对树估计值进行的推断性总结很有用。在生物信息学、心理测量学和数据挖掘中,层次聚类技术输出的是相同的数学对象,从业者对这些总结的稳定性和“可推广性”也有类似的问题。本文描述了由比勒拉、霍姆斯和沃格特曼(2001年)开发的树之间几何距离的实现方法,该方法同样适用于系统发育树和层次聚类树,并展示了其在评估树估计值方面的一些应用。特别是,由于比勒拉等人(2001年)已经表明树空间是负曲率的(称为CAT(0)空间),一组树可以自然地表示为一棵树。我们将这种表示与通过经典多维缩放和树之间距离矩阵的核多维缩放得到的树空间的欧几里得近似进行比较。我们还提供了树之间距离在由微阵列构建的层次聚类树上的应用。我们的方法提供了一种新的方式来评估某些列(位置、变量或基因)和某些行(物种、观测值或阵列)对这类树构建的影响。它还可以提供一种检测输入数据中异质混合物的方法。本文的补充材料可在线获取。