Gregor Mendel Institute, Vienna, Austria.
PLoS One. 2013;8(2):e56883. doi: 10.1371/journal.pone.0056883. Epub 2013 Feb 15.
Single Nucleotide Polymorphisms (SNPs) are one of the largest sources of new data in biology. In most papers, SNPs between individuals are visualized with Principal Component Analysis (PCA), an older method for this purpose.
We compare PCA, an aging method for this purpose, with a newer method, t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of large SNP datasets. We also propose a set of key figures for evaluating these visualizations; in all of these t-SNE performs better.
To transform data PCA remains a reasonably good method, but for visualization it should be replaced by a method from the subfield of dimension reduction. To evaluate the performance of visualization, we propose key figures of cross-validation with machine learning methods, as well as indices of cluster validity.
单核苷酸多态性(SNPs)是生物学中最大的新数据来源之一。在大多数论文中,个体之间的 SNPs 是通过主成分分析(PCA)来可视化的,这是一种用于此目的的较旧方法。
我们将 PCA(一种用于此目的的老化方法)与一种较新的方法 t-分布随机邻域嵌入(t-SNE)进行比较,用于可视化大型 SNP 数据集。我们还提出了一组用于评估这些可视化的关键指标;在所有这些指标中,t-SNE 的表现都更好。
要转换数据,PCA 仍然是一种相当不错的方法,但对于可视化,它应该被降维子领域的方法所取代。为了评估可视化的性能,我们提出了使用机器学习方法进行交叉验证的关键指标,以及聚类有效性的指标。