Orlov Alexey A, Akhmetshin Tagir N, Horvath Dragos, Marcou Gilles, Varnek Alexandre
Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 4, Blaise Pascal Str., 67000, Strasbourg, France.
Mol Inform. 2025 Jan;44(1):e202400265. doi: 10.1002/minf.202400265. Epub 2024 Dec 5.
Dimensionality reduction is an important exploratory data analysis method that allows high-dimensional data to be represented in a human-interpretable lower-dimensional space. It is extensively applied in the analysis of chemical libraries, where chemical structure data - represented as high-dimensional feature vectors-are transformed into 2D or 3D chemical space maps. In this paper, commonly used dimensionality reduction techniques - Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), and Generative Topographic Mapping (GTM) - are evaluated in terms of neighborhood preservation and visualization capability of sets of small molecules from the ChEMBL database.
降维是一种重要的探索性数据分析方法,它能使高维数据在人类可解释的低维空间中得到表示。它在化学库分析中得到广泛应用,在化学库分析中,以高维特征向量表示的化学结构数据被转换为二维或三维化学空间图。本文从ChEMBL数据库中小分子集合的邻域保留和可视化能力方面,对常用的降维技术——主成分分析(PCA)、t分布随机邻域嵌入(t-SNE)、均匀流形近似与投影(UMAP)和生成地形映射(GTM)进行了评估。