Department of Computer and Communications Engineering, Kangwon National University, Chuncheon-si, Gangwon-do, 24341, South Korea.
Genes Genomics. 2020 Feb;42(2):225-234. doi: 10.1007/s13258-019-00896-6. Epub 2019 Dec 12.
One of the apparent characteristics of bioinformatics data is the combination of very large number of features and relatively small number of samples. The vast number of features makes intuitive understanding of a target domain difficult. Dimensionality reduction or manifold learning has potential to circumvent this obstacle, but restricted methods have been preferred.
The objective of this study is to observe the characteristics of various dimensionality reduction methods-locally linear embedding (LLE), multi-dimensional scaling (MDS), principal component analysis (PCA), spectral embedding (SE), and t-distributed Stochastic Neighbor Embedding (t-SNE)-on the RNA-Seq dataset from the genotype-tissue expression (GTEx) project.
The characteristics of the dimensionality reduction methods are observed on the nine groups of three different tissues in the reduced space with dimensionality of two, three, and four. The visualization results report that each dimensionality reduction method produces a very distinct reduced space. The quantitative results are obtained as the performance of k-means clustering. Clustering in the reduced space from non-linear methods such as LLE, t-SNE and SE achieved better results than in the reduced space produced by linear methods like PCA and MDS.
The experimental results recommend the application of both linear and non-linear dimensionality reduction methods on the target data for grasping the underlying characteristics of the datasets intuitively.
生物信息学数据的一个明显特征是大量特征与相对较少的样本相结合。大量的特征使得直观理解目标域变得困难。降维和流形学习有可能克服这一障碍,但受到限制的方法更为常见。
本研究的目的是观察局部线性嵌入(LLE)、多维尺度分析(MDS)、主成分分析(PCA)、谱嵌入(SE)和 t 分布随机邻居嵌入(t-SNE)等各种降维方法在基因型组织表达(GTEx)项目的 RNA-Seq 数据集上的特征。
在降维到二维、三维和四位的九个三组不同组织的子空间中观察到了降维方法的特征。可视化结果表明,每种降维方法都产生了非常独特的降维空间。通过 k-均值聚类的性能获得了定量结果。来自非线性方法(如 LLE、t-SNE 和 SE)的降维空间中的聚类比来自线性方法(如 PCA 和 MDS)的降维空间中的聚类表现更好。
实验结果推荐在目标数据上应用线性和非线性降维方法,以便直观地掌握数据集的潜在特征。