Gaspar Héléna A, Baskin Igor I, Marcou Gilles, Horvath Dragos, Varnek Alexandre
Laboratory of Chemoinformatics, University of Strasbourg , 67081 Strasbourg, France.
J Chem Inf Model. 2015 Jan 26;55(1):84-94. doi: 10.1021/ci500575y. Epub 2014 Dec 19.
This paper is devoted to the analysis and visualization in 2-dimensional space of large data sets of millions of compounds using the incremental version of generative topographic mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program was applied to a database of more than 2 million compounds combining data sets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis were proposed. The chemical space coverage was evaluated using the normalized Shannon entropy. Different views of the data (property landscapes) were obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helped to identify the regions in the chemical space populated by compounds with desirable physicochemical profiles and the suppliers providing them. The data sets similarity in the latent space was assessed by applying several metrics (Euclidean distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data sets were compared by considering them as individual objects on a meta-GTM map, built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way to analyze and visualize large chemical databases.
本文致力于使用生成地形映射(iGTM)的增量版本,在二维空间中对数百万种化合物的大数据集进行分析和可视化。在内部ISIDA - GTM程序中实现的iGTM算法被应用于一个包含超过200万种化合物的数据库,该数据库结合了36家化学品供应商的数据集和美国国立癌症研究所(NCI)的化合物集,这些化合物由分子操作环境(MOE)描述符或MACCS键编码。利用GTM的概率性质,提出了几种数据分析方法。使用归一化香农熵评估化学空间覆盖率。通过将各种物理和化学性质(分子量、水溶性、辛醇/水分配系数等)映射到iGTM图上,获得了数据的不同视图(性质景观)。这些视图的叠加有助于识别化学空间中具有理想物理化学性质的化合物所占据的区域以及提供这些化合物的供应商。通过基于累积责任向量对数据概率分布应用几种度量(欧几里得距离、塔尼莫托系数和巴塔查里亚系数),评估了潜在空间中数据集的相似性。作为一种补充方法,通过将数据集视为在基于iGTM生成的累积责任向量或性质景观构建的元GTM图上的单个对象来进行比较。我们认为本文中描述的iGTM方法代表了一种快速且可靠的分析和可视化大型化学数据库的方法。