基于核的机器学习模型在虚拟筛选中适用性域的估计。

Estimation of the applicability domain of kernel-based machine learning models for virtual screening.

机构信息

Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Sand 1, 72076 Tübingen, Germany.

出版信息

J Cheminform. 2010 Mar 11;2(1):2. doi: 10.1186/1758-2946-2-2.

DOI:10.1186/1758-2946-2-2

PMID:20222949

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2851576/

Abstract

BACKGROUND

The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.

RESULTS

We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.

CONCLUSION

The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.

摘要

背景

大型化合物数据库的虚拟筛选是结构活性关系模型的一个重要应用。由于这些数据集的结构高度多样化，基于机器学习的 QSAR 模型（依赖于特定的训练集）不可能对所有化合物给出可靠的结果。因此，考虑模型适用的化学空间子集是很重要的。到目前为止，已经发表的解决这个问题的方法大多使用向量描述符表示来定义模型的适用域。不幸的是，这些方法不容易扩展到基于结构化核的机器学习模型。为此，我们提出了三种方法来估计基于核的 QSAR 模型的适用域。

结果

我们使用三种不同的结构化核在三个虚拟筛选任务上评估了三种基于核的适用性域估计方法。每个实验都包括使用支持向量回归训练基于核的 QSAR 模型，并根据预测的活性对不相交的筛选数据集进行排序。对于每个预测，通过适用性域公式获得的分数来定量描述模型对各自化合物的适用性。通过比较不同适用性得分阈值下筛选数据集子集的模型性能来评估适用性域估计的适用性。这种比较表明，可以将模型能够给出可靠预测的化学空间部分与与训练集差异太大而无法成功应用模型的结构部分区分开来。进一步的研究表明，如果从筛选中省略一半（即具有最低适用性得分的分子），则模型的虚拟筛选性能可以得到显著提高。

结论

我们为基于核的 QSAR 模型提出的适用性域公式可以成功识别出模型无法给出可靠预测的化合物。由此减少的搜索空间和一些活性化合物的消除不应被视为缺点，因为结果表明，在大多数情况下，这些被忽略的配体无论如何都不会被模型找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7db8/2851576/17f3abd97542/1758-2946-2-2-3.jpg

相似文献

Estimation of the applicability domain of kernel-based machine learning models for virtual screening.

J Cheminform. 2010 Mar 11;2(1):2. doi: 10.1186/1758-2946-2-2.

Evaluation of QSAR Equations for Virtual Screening.

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

Molecule kernels: a descriptor- and alignment-free quantitative structure-activity relationship approach.

J Chem Inf Model. 2008 Sep;48(9):1868-81. doi: 10.1021/ci800144y. Epub 2008 Sep 4.

Rank order entropy: why one metric is not enough.

J Chem Inf Model. 2011 Sep 26;51(9):2302-19. doi: 10.1021/ci200170k. Epub 2011 Aug 29.

Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models.

J Chem Inf Model. 2014 Feb 24;54(2):431-41. doi: 10.1021/ci4006595. Epub 2014 Feb 11.

Kinase-kernel models: accurate in silico screening of 4 million compounds across the entire human kinome.

J Chem Inf Model. 2012 Jan 23;52(1):156-70. doi: 10.1021/ci200314j. Epub 2012 Jan 6.

J Chem Inf Model. 2019 Jan 28;59(1):181-189. doi: 10.1021/acs.jcim.8b00597. Epub 2018 Nov 19.

A Kernel-Based Method for Assessing Uncertainty on Individual QSAR Predictions.

Mol Inform. 2012 Oct;31(10):741-51. doi: 10.1002/minf.201200053. Epub 2012 Sep 25.

Predictive QSAR modeling workflow, model applicability domains, and virtual screening.

Curr Pharm Des. 2007;13(34):3494-504. doi: 10.2174/138161207782794257.

Enhancing Acute Oral Toxicity Predictions by using Consensus Modeling and Algebraic Form-Based 0D-to-2D Molecular Encodes.

Chem Res Toxicol. 2019 Jun 17;32(6):1178-1192. doi: 10.1021/acs.chemrestox.9b00011. Epub 2019 May 17.

引用本文的文献

Machine Learning for Toxicity Prediction Using Chemical Structures: Pillars for Success in the Real World.

Chem Res Toxicol. 2025 May 19;38(5):759-807. doi: 10.1021/acs.chemrestox.5c00033. Epub 2025 May 2.

Reformulating Reactivity Design for Data-Efficient Machine Learning.

ACS Catal. 2023 Oct 6;13(20):13506-13515. doi: 10.1021/acscatal.3c02513. eCollection 2023 Oct 20.

Federated Learning in Computational Toxicology: An Industrial Perspective on the Effiris Hackathon.

Chem Res Toxicol. 2023 Sep 18;36(9):1503-1517. doi: 10.1021/acs.chemrestox.3c00137. Epub 2023 Aug 16.

Krein support vector machine classification of antimicrobial peptides.

Digit Discov. 2023 Feb 27;2(2):502-511. doi: 10.1039/d3dd00004d. eCollection 2023 Apr 11.

Computational Prediction of Compound-Protein Interactions for Orphan Targets Using CGBVS.

Molecules. 2021 Aug 24;26(17):5131. doi: 10.3390/molecules26175131.

Anti-Ebola: an initiative to predict Ebola virus inhibitors through machine learning.

Mol Divers. 2022 Jun;26(3):1635-1644. doi: 10.1007/s11030-021-10291-7. Epub 2021 Aug 6.

Comprehensive Analysis of Applicability Domains of QSPR Models for Chemical Reactions.

Int J Mol Sci. 2020 Aug 3;21(15):5542. doi: 10.3390/ijms21155542.

Enhanced ranking of PknB Inhibitors using data fusion methods.

J Cheminform. 2013 Jan 14;5(1):2. doi: 10.1186/1758-2946-5-2.

jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints.

J Cheminform. 2011 Jan 10;3(1):3. doi: 10.1186/1758-2946-3-3.

本文引用的文献

Optimal assignment methods for ligand-based virtual screening.

J Cheminform. 2009 Aug 25;1:14. doi: 10.1186/1758-2946-1-14.

Predicting the predictability: a unified approach to the applicability domain problem of QSAR models.

J Chem Inf Model. 2009 Jul;49(7):1762-76. doi: 10.1021/ci9000579.

Performance of machine learning methods for ligand-based virtual screening.

Comb Chem High Throughput Screen. 2009 May;12(4):358-68. doi: 10.2174/138620709788167962.

Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries.

Comb Chem High Throughput Screen. 2009 May;12(4):344-57. doi: 10.2174/138620709788167944.

Machine learning in virtual screening.

Comb Chem High Throughput Screen. 2009 May;12(4):332-43. doi: 10.2174/138620709788167980.

Atomic local neighborhood flexibility incorporation into a structured similarity measure for QSAR.

J Chem Inf Model. 2009 Mar;49(3):549-60. doi: 10.1021/ci800329r.

Empirical scoring functions for advanced protein-ligand docking with PLANTS.

J Chem Inf Model. 2009 Jan;49(1):84-96. doi: 10.1021/ci800298z.

FieldScreen: virtual screening using molecular fields. Application to the DUD data set.

J Chem Inf Model. 2008 Nov;48(11):2108-17. doi: 10.1021/ci800110p.

Additive SMILES-based optimal descriptors in QSAR modelling bee toxicity: Using rare SMILES attributes to define the applicability domain.

Bioorg Med Chem. 2008 May 1;16(9):4801-9. doi: 10.1016/j.bmc.2008.03.048. Epub 2008 Mar 23.

Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screening.

J Comput Aided Mol Des. 2008 Sep;22(9):593-609. doi: 10.1007/s10822-008-9199-2. Epub 2008 Mar 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于核的机器学习模型在虚拟筛选中适用性域的估计。

Estimation of the applicability domain of kernel-based machine learning models for virtual screening.

机构信息

Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Sand 1, 72076 Tübingen, Germany.