Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Sand 1, 72076 Tübingen, Germany.
J Cheminform. 2010 Mar 11;2(1):2. doi: 10.1186/1758-2946-2-2.
The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.
We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.
The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
大型化合物数据库的虚拟筛选是结构活性关系模型的一个重要应用。由于这些数据集的结构高度多样化,基于机器学习的 QSAR 模型(依赖于特定的训练集)不可能对所有化合物给出可靠的结果。因此,考虑模型适用的化学空间子集是很重要的。到目前为止,已经发表的解决这个问题的方法大多使用向量描述符表示来定义模型的适用域。不幸的是,这些方法不容易扩展到基于结构化核的机器学习模型。为此,我们提出了三种方法来估计基于核的 QSAR 模型的适用域。
我们使用三种不同的结构化核在三个虚拟筛选任务上评估了三种基于核的适用性域估计方法。每个实验都包括使用支持向量回归训练基于核的 QSAR 模型,并根据预测的活性对不相交的筛选数据集进行排序。对于每个预测,通过适用性域公式获得的分数来定量描述模型对各自化合物的适用性。通过比较不同适用性得分阈值下筛选数据集子集的模型性能来评估适用性域估计的适用性。这种比较表明,可以将模型能够给出可靠预测的化学空间部分与与训练集差异太大而无法成功应用模型的结构部分区分开来。进一步的研究表明,如果从筛选中省略一半(即具有最低适用性得分的分子),则模型的虚拟筛选性能可以得到显著提高。
我们为基于核的 QSAR 模型提出的适用性域公式可以成功识别出模型无法给出可靠预测的化合物。由此减少的搜索空间和一些活性化合物的消除不应被视为缺点,因为结果表明,在大多数情况下,这些被忽略的配体无论如何都不会被模型找到。