Colorado State University, Fort Collins, CO, USA.
Colorado State University, Fort Collins, CO, USA.
Methods. 2018 Jan 1;132:26-33. doi: 10.1016/j.ymeth.2017.09.006. Epub 2017 Sep 15.
This paper presents several geometrically motivated techniques for the visualization of high-dimensional biological data sets. The Grassmann manifold provides a robust framework for measuring data similarity in a subspace context. Sparse radial basis function classification as a visualization technique leverages recent advances in radial basis function learning via convex optimization. In the spirit of deep belief networks, supervised centroid-encoding is proposed as a way to exploit class label information. These methods are compared to linear and nonlinear principal component analysis (autoencoders) in the context of data visualization; these approaches may perform poorly for visualization when the variance of the data is spread across more than three dimensions. In contrast, the proposed methods are shown to capture significant data structure in two or three dimensions, even when the information in the data lives in higher dimensional subspaces. To illustrate these ideas, the visualization techniques are applied to gene expression data sets that capture the host immune system's response to infection by the Ebola virus in non-human primate and collaborative cross mouse.
本文提出了几种基于几何的技术,用于高维生物数据集的可视化。Grassmann 流形为子空间环境下测量数据相似性提供了一个强大的框架。稀疏径向基函数分类作为一种可视化技术,利用了最近在凸优化中的径向基函数学习方面的进展。本着深度置信网络的精神,提出了监督质心编码作为利用类别标签信息的一种方法。在数据可视化的背景下,将这些方法与线性和非线性主成分分析(自动编码器)进行了比较;当数据的方差分布在三个以上维度时,这些方法在可视化方面的表现可能不佳。相比之下,即使数据中的信息存在于更高维的子空间中,所提出的方法也被证明可以在二维或三维中捕获重要的数据结构。为了说明这些想法,将可视化技术应用于基因表达数据集,这些数据集捕获了非人类灵长类动物和合作交叉小鼠中宿主免疫系统对埃博拉病毒感染的反应。