Nonlinearity and Complexity Research Group, Aston University, Aston Triangle, Birmingham B4 7ET, United Kingdom.
J Chem Inf Model. 2011 Jul 25;51(7):1552-63. doi: 10.1021/ci1004042. Epub 2011 Jul 8.
A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.
数据集的分子数据可视化图是深入了解一组分子的有用工具。在化学信息学中,大多数可视化图都是分子描述符,最常用于生成可视化图的统计模型通常是主成分分析(PCA)。本文采用 PCA 以及其他四个统计模型(NeuroScale、GTM、LTM 和 LTM-LIN),评估它们在不基于分子描述符、而是基于分子指纹的可视化图中产生聚类的能力。本文解决了两个不同的任务:理解结构信息(特别是组合库)和将结构与活性相关联。通过主观(通过视觉检查)和客观(通过全局距离比较和局部 k-最近邻预测器)比较了可视化图的质量。在所使用的数据集中,LTM 在评估结构聚类方面的表现明显优于其他模型。特别是,LTM 可视化空间中的聚类与定义组合子库的核心支架之间的关系一致。在所使用的数据集中,LTM 再次给出了最佳性能,但差距较小。本文的结果表明,使用非线性投影图和伯努利噪声模型对二进制数据进行建模具有价值。