Kireeva Natalia V, Ovchinnikova Svetlana I, Tetko Igor V, Asiri Abdullah M, Balakin Konstantin V, Tsivadze Aslan Yu
Laboratory of New Physical-Chemical Problems, Frumkin Institute of Physical, Chemistry & Electrochemistry, Russian Academy of Sciences, Leninsky pr-t 31, 119071 Moscow (Russia); Department of Molecular Physics, Moscow Institute of Physics & Technology, Institutsky per. 9, 141700, Dolgoprudny (Russia).
ChemMedChem. 2014 May;9(5):1047-59. doi: 10.1002/cmdc.201400027. Epub 2014 Apr 11.
Over the years, a number of dimensionality reduction techniques have been proposed and used in chemoinformatics to perform nonlinear mappings. In this study, four representatives of nonlinear dimensionality reduction methods related to two different families were analyzed: distance-based approaches (Isomap and Diffusion Maps) and topology-based approaches (Generative Topographic Mapping (GTM) and Laplacian Eigenmaps). The considered methods were applied for the visualization of three toxicity datasets by using four sets of descriptors. Two methods, GTM and Diffusion Maps, were identified as the best approaches, which thus made it impossible to prioritize a single family of the considered dimensionality reduction methods. The intrinsic dimensionality assessment of data was performed by using the Maximum Likelihood Estimation. It was observed that descriptor sets with a higher intrinsic dimensionality contributed maps of lower quality. A new statistical coefficient, which combines two previously known ones, was proposed to automatically rank the maps. Instead of relying on one of the best methods, we propose to automatically generate maps with different parameter values for different descriptor sets. By following this procedure, the maps with the highest values of the introduced statistical coefficient can be automatically selected and used as a starting point for visual inspection by the user.
多年来,化学信息学领域已经提出并使用了多种降维技术来进行非线性映射。在本研究中,分析了与两个不同类别相关的非线性降维方法的四个代表:基于距离的方法(等距映射和扩散映射)和基于拓扑的方法(生成地形映射(GTM)和拉普拉斯特征映射)。通过使用四组描述符,将所考虑的方法应用于三个毒性数据集的可视化。两种方法,即GTM和扩散映射,被确定为最佳方法,因此无法在所考虑的降维方法类别中确定单一的优先级。通过使用最大似然估计对数据进行内在维度评估。观察到具有较高内在维度的描述符集生成的映射质量较低。提出了一种结合两个先前已知系数的新统计系数,用于自动对映射进行排序。我们建议不要依赖于最佳方法之一,而是针对不同的描述符集自动生成具有不同参数值的映射。通过遵循此过程,可以自动选择具有引入的统计系数最高值的映射,并将其用作供用户进行视觉检查的起点。