基于增量生成地形映射的化学数据可视化与分析：大数据挑战

Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge.

作者信息

Gaspar Héléna A, Baskin Igor I, Marcou Gilles, Horvath Dragos, Varnek Alexandre

机构信息

Laboratory of Chemoinformatics, University of Strasbourg , 67081 Strasbourg, France.

出版信息

J Chem Inf Model. 2015 Jan 26;55(1):84-94. doi: 10.1021/ci500575y. Epub 2014 Dec 19.

DOI:10.1021/ci500575y

PMID:25423612

Abstract

This paper is devoted to the analysis and visualization in 2-dimensional space of large data sets of millions of compounds using the incremental version of generative topographic mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program was applied to a database of more than 2 million compounds combining data sets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis were proposed. The chemical space coverage was evaluated using the normalized Shannon entropy. Different views of the data (property landscapes) were obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helped to identify the regions in the chemical space populated by compounds with desirable physicochemical profiles and the suppliers providing them. The data sets similarity in the latent space was assessed by applying several metrics (Euclidean distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data sets were compared by considering them as individual objects on a meta-GTM map, built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way to analyze and visualize large chemical databases.

摘要

本文致力于使用生成地形映射（iGTM）的增量版本，在二维空间中对数百万种化合物的大数据集进行分析和可视化。在内部ISIDA - GTM程序中实现的iGTM算法被应用于一个包含超过200万种化合物的数据库，该数据库结合了36家化学品供应商的数据集和美国国立癌症研究所（NCI）的化合物集，这些化合物由分子操作环境（MOE）描述符或MACCS键编码。利用GTM的概率性质，提出了几种数据分析方法。使用归一化香农熵评估化学空间覆盖率。通过将各种物理和化学性质（分子量、水溶性、辛醇/水分配系数等）映射到iGTM图上，获得了数据的不同视图（性质景观）。这些视图的叠加有助于识别化学空间中具有理想物理化学性质的化合物所占据的区域以及提供这些化合物的供应商。通过基于累积责任向量对数据概率分布应用几种度量（欧几里得距离、塔尼莫托系数和巴塔查里亚系数），评估了潜在空间中数据集的相似性。作为一种补充方法，通过将数据集视为在基于iGTM生成的累积责任向量或性质景观构建的元GTM图上的单个对象来进行比较。我们认为本文中描述的iGTM方法代表了一种快速且可靠的分析和可视化大型化学数据库的方法。

相似文献

Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge.

J Chem Inf Model. 2015 Jan 26;55(1):84-94. doi: 10.1021/ci500575y. Epub 2014 Dec 19.

Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison.

Mol Inform. 2012 Apr;31(3-4):301-12. doi: 10.1002/minf.201100163. Epub 2012 Apr 4.

Data Visualization, Regression, Applicability Domains and Inverse Analysis Based on Generative Topographic Mapping.

Mol Inform. 2019 Mar;38(3):e1800088. doi: 10.1002/minf.201800088. Epub 2018 Sep 27.

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Mol Inform. 2020 Dec;39(12):e2000009. doi: 10.1002/minf.202000009. Epub 2020 Apr 29.

Mapping of the Available Chemical Space versus the Chemical Universe of Lead-Like Compounds.

ChemMedChem. 2018 Mar 20;13(6):540-554. doi: 10.1002/cmdc.201700561. Epub 2018 Jan 29.

GTM-Based QSAR Models and Their Applicability Domains.

Mol Inform. 2015 Jun;34(6-7):348-56. doi: 10.1002/minf.201400153. Epub 2015 Feb 3.

Diversifying chemical libraries with generative topographic mapping.

J Comput Aided Mol Des. 2020 Jul;34(7):805-815. doi: 10.1007/s10822-019-00215-x. Epub 2019 Aug 12.

Stargate GTM: Bridging Descriptor and Activity Spaces.

J Chem Inf Model. 2015 Nov 23;55(11):2403-10. doi: 10.1021/acs.jcim.5b00398. Epub 2015 Oct 20.

Generative topographic mapping-based classification models and their applicability domain: application to the biopharmaceutics Drug Disposition Classification System (BDDCS).

J Chem Inf Model. 2013 Dec 23;53(12):3318-25. doi: 10.1021/ci400423c. Epub 2013 Dec 9.

Visualization and Analysis of the REACH-chemical Space with Generative Topographic Mapping.

Mol Inform. 2021 Apr;40(4):e2000232. doi: 10.1002/minf.202000232. Epub 2020 Nov 24.

引用本文的文献

From molecules to data: the emerging impact of chemoinformatics in chemistry.

J Cheminform. 2025 Aug 7;17(1):121. doi: 10.1186/s13321-025-00978-6.

Undersampling techniques for non-linear chemical space visualization.

bioRxiv. 2025 Jul 7:2025.07.03.663077. doi: 10.1101/2025.07.03.663077.

CoLiNN: A Tool for Fast Chemical Space Visualization of Combinatorial Libraries Without Enumeration.

Mol Inform. 2025 Mar;44(3):e202400263. doi: 10.1002/minf.202400263.

From High Dimensions to Human Insight: Exploring Dimensionality Reduction for Chemical Space Visualization.

Mol Inform. 2025 Jan;44(1):e202400265. doi: 10.1002/minf.202400265. Epub 2024 Dec 5.

Sampling and Mapping Chemical Space with Extended Similarity Indices.

Molecules. 2023 Aug 30;28(17):6333. doi: 10.3390/molecules28176333.

Evaluation of the Topology Space of DNA-Encoded Libraries.

J Chem Inf Model. 2023 Aug 14;63(15):4641-4653. doi: 10.1021/acs.jcim.3c01008. Epub 2023 Jul 26.

Chemical space exploration guided by deep neural networks.

RSC Adv. 2019 Feb 11;9(9):5151-5157. doi: 10.1039/c8ra10182e. eCollection 2019 Feb 5.

A critical overview of computational approaches employed for COVID-19 drug discovery.

Chem Soc Rev. 2021 Aug 21;50(16):9121-9151. doi: 10.1039/d0cs01065k. Epub 2021 Jul 2.

DMSO Solubility Assessment for Fragment-Based Screening.

Molecules. 2021 Jun 28;26(13):3950. doi: 10.3390/molecules26133950.

Active discovery of organic semiconductors.

Nat Commun. 2021 Apr 23;12(1):2422. doi: 10.1038/s41467-021-22611-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于增量生成地形映射的化学数据可视化与分析：大数据挑战

Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献