Kayastha Shilva, Kunimoto Ryo, Horvath Dragos, Varnek Alexandre, Bajorath Jürgen
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, 53113, Bonn, Germany.
Laboratoire de Chemoinformatique, UMR 7140, Université de Strasbourg, 1 rue Blaise Pascal, 67000, Strasbourg, France.
J Comput Aided Mol Des. 2017 Nov;31(11):961-977. doi: 10.1007/s10822-017-0070-1. Epub 2017 Oct 6.
The analysis of structure-activity relationships (SARs) becomes rather challenging when large and heterogeneous compound data sets are studied. In such cases, many different compounds and their activities need to be compared, which quickly goes beyond the capacity of subjective assessments. For a comprehensive large-scale exploration of SARs, computational analysis and visualization methods are required. Herein, we introduce a two-layered SAR visualization scheme specifically designed for increasingly large compound data sets. The approach combines a new compound pair-based variant of generative topographic mapping (GTM), a machine learning approach for nonlinear mapping, with chemical space networks (CSNs). The GTM component provides a global view of the activity landscapes of large compound data sets, in which informative local SAR environments are identified, augmented by a numerical SAR scoring scheme. Prioritized local SAR regions are then projected into CSNs that resolve these regions at the level of individual compounds and their relationships. Analysis of CSNs makes it possible to distinguish between regions having different SAR characteristics and select compound subsets that are rich in SAR information.
当研究大规模且异质的化合物数据集时,结构-活性关系(SARs)的分析变得颇具挑战性。在这种情况下,需要比较许多不同的化合物及其活性,这很快就超出了主观评估的能力范围。对于SARs的全面大规模探索,需要计算分析和可视化方法。在此,我们介绍一种专门为日益增大的化合物数据集设计的两层SAR可视化方案。该方法将生成地形映射(GTM,一种用于非线性映射的机器学习方法)基于化合物对的新变体与化学空间网络(CSNs)相结合。GTM组件提供了大型化合物数据集活性景观的全局视图,其中通过数值SAR评分方案识别出信息丰富的局部SAR环境。然后将优先考虑的局部SAR区域投影到CSNs中,CSNs在单个化合物及其关系层面解析这些区域。对CSNs的分析能够区分具有不同SAR特征的区域,并选择富含SAR信息的化合物子集。