使用原子环境描述符（MOLPRINT 2D）对化学数据库进行相似性搜索：性能评估

Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

作者信息

Bender Andreas, Mussa Hamse Y, Glen Robert C, Reiling Stephan

机构信息

Unilever Centre for Molecular Science Informatics, Chemistry Department, University of Cambridge, Cambridge CB2 1EW, United Kingdom.

出版信息

J Chem Inf Comput Sci. 2004 Sep-Oct;44(5):1708-18. doi: 10.1021/ci0498719.

DOI:10.1021/ci0498719

PMID:15446830

Abstract

A molecular similarity searching technique based on atom environments, information-gain-based feature selection, and the naive Bayesian classifier has been applied to a series of diverse datasets and its performance compared to those of alternative searching methods. Atom environments are count vectors of heavy atoms present at a topological distance from each heavy atom of a molecular structure. In this application, using a recently published dataset of more than 100000 molecules from the MDL Drug Data Report database, the atom environment approach appears to outperform fusion of ranking scores as well as binary kernel discrimination, which are both used in combination with Unity fingerprints. Overall retrieval rates among the top 5% of the sorted library are nearly 10% better (more than 14% better in relative numbers) than those of the second best method, Unity fingerprints and binary kernel discrimination. In 10 out of 11 sets of active compounds the combination of atom environments and the naive Bayesian classifier appears to be the superior method, while in the remaining dataset, data fusion and binary kernel discrimination in combination with Unity fingerprints is the method of choice. Binary kernel discrimination in combination with Unity fingerprints generally comes second in performance overall. The difference in performance can largely be attributed to the different molecular descriptors used. Atom environments outperform Unity fingerprints by a large margin if the combination of these descriptors with the Tanimoto coefficient is compared. The naive Bayesian classifier in combination with information-gain-based feature selection and selection of a sensible number of features performs about as well as binary kernel discrimination in experiments where these classification methods are compared. When used on a monoaminooxidase dataset, atom environments and the naive Bayesian classifier perform as well as binary kernel discrimination in the case of a 50/50 split of training and test compounds. In the case of sparse training data, binary kernel discrimination is found to be superior on this particular dataset. On a third dataset, the atom environment descriptor shows higher retrieval rates than other 2D fingerprints tested here when used in combination with the Tanimoto similarity coefficient. Feature selection is shown to be a crucial step in determining the performance of the algorithm. The representation of molecules by atom environments is found to be more effective than Unity fingerprints for the type of biological receptor similarity calculations examined here. Combining information prior to scoring and including information about inactive compounds, as in the Bayesian classifier and binary kernel discrimination, is found to be superior to posterior data fusion (in the datasets tested here).

摘要

一种基于原子环境、基于信息增益的特征选择和朴素贝叶斯分类器的分子相似性搜索技术已应用于一系列不同的数据集，并将其性能与其他搜索方法进行了比较。原子环境是指在分子结构中与每个重原子存在拓扑距离的重原子的计数向量。在本应用中，使用最近发布的来自MDL药物数据报告数据库的超过100,000个分子的数据集，原子环境方法似乎优于排名分数融合以及二元核判别，这两种方法都与Unity指纹结合使用。在排序库的前5%中，总体检索率比次优方法Unity指纹和二元核判别高出近10%（相对数字高出超过14%）。在11组活性化合物中的10组中，原子环境与朴素贝叶斯分类器的组合似乎是 superior 方法，而在其余数据集中，数据融合和与Unity指纹结合的二元核判别是首选方法。与Unity指纹结合的二元核判别在整体性能上通常排名第二。性能差异在很大程度上可归因于所使用的不同分子描述符。如果将这些描述符与Tanimoto系数的组合进行比较，原子环境在很大程度上优于Unity指纹。在比较这些分类方法的实验中，与基于信息增益的特征选择和合理数量特征选择相结合的朴素贝叶斯分类器的性能与二元核判别大致相同。当用于单胺氧化酶数据集时，在训练和测试化合物按50/50分割的情况下，原子环境和朴素贝叶斯分类器的性能与二元核判别相同。在稀疏训练数据的情况下，发现在这个特定数据集上二元核判别更 superior 。在第三个数据集上，当与Tanimoto相似系数结合使用时，原子环境描述符显示出比这里测试的其他二维指纹更高的检索率。特征选择被证明是确定算法性能的关键步骤。对于此处研究的生物受体相似性计算类型，发现用原子环境表示分子比Unity指纹更有效。如在贝叶斯分类器和二元核判别中那样，在评分前组合信息并包括有关非活性化合物的信息，被发现优于后验数据融合（在此处测试的数据集中）。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用原子环境描述符（MOLPRINT 2D）对化学数据库进行相似性搜索：性能评估

Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

使用原子环境描述符（MOLPRINT 2D）对化学数据库进行相似性搜索：性能评估

Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

作者信息

机构信息

出版信息

相似文献

引用本文的文献