使用随机森林模型来模拟另一个随机森林模型的领域适用性。

Using random forest to model the domain applicability of another random forest model.

机构信息

Cheminformatics Department, Merck Research Laboratories , RY800-D133, Rahway, New Jersey 07065, United States.

出版信息

J Chem Inf Model. 2013 Nov 25;53(11):2837-50. doi: 10.1021/ci400482e. Epub 2013 Nov 5.

Abstract

In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.

摘要

在定量构效关系（QSAR）中，统计模型是从分子的训练集（用化学描述符表示）及其生物活性中生成的。我们将这种传统类型的 QSAR 模型称为“活性模型”。活性模型可用于预测未包含在训练集中的分子的活性。QSAR 的一个相对较新的子领域是领域适用性。目的是估计在特定活性模型上预测特定分子的可靠性。为此，文献中提出了许多不同的指标。理想情况下，针对一个或多个这些指标构建可靠性的定量模型。我们可以将其称为“误差模型”。我们实验室的先前出版物（Sheridan J. Chem. Inf. Model.，2012，52，814-823）表明，同时使用三种指标比使用任何一种指标更具辨别力。误差模型可以以三维的方式构建成一组箱。然而，当指标数量超过三个时，箱方法就不实用了。使用多个指标构建误差模型的一种明显方法是使用 QSAR 方法，在我们的情况下是随机森林。在本文中，我们展示了这种方法的有效性，特别是确定是否可以构建有用的误差模型以及哪些指标对于给定问题最有用。对于十个数据集和我们在这里检查的七个指标，似乎可以使用仅两个指标（TREE_SD 和 PREDICTED）构建有用的误差模型。这些指标不需要计算预测分子和构建活性模型的分子之间的相似性/距离，这可能会受到限制。

相似文献

Using random forest to model the domain applicability of another random forest model.

J Chem Inf Model. 2013 Nov 25;53(11):2837-50. doi: 10.1021/ci400482e. Epub 2013 Nov 5.

The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity.

J Chem Inf Model. 2015 Jun 22;55(6):1098-107. doi: 10.1021/acs.jcim.5b00110. Epub 2015 Jun 4.

Three useful dimensions for domain applicability in QSAR models using random forest.

J Chem Inf Model. 2012 Mar 26;52(3):814-23. doi: 10.1021/ci300004n. Epub 2012 Mar 9.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Pre-processing feature selection for improved C&RT models for oral absorption.

J Chem Inf Model. 2013 Oct 28;53(10):2730-42. doi: 10.1021/ci400378j. Epub 2013 Oct 9.

Does rational selection of training and test sets improve the outcome of QSAR modeling?

J Chem Inf Model. 2012 Oct 22;52(10):2570-8. doi: 10.1021/ci300338w. Epub 2012 Oct 3.

Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection.

J Chem Inf Model. 2008 Sep;48(9):1733-46. doi: 10.1021/ci800151m. Epub 2008 Aug 26.

Rank order entropy: why one metric is not enough.

J Chem Inf Model. 2011 Sep 26;51(9):2302-19. doi: 10.1021/ci200170k. Epub 2011 Aug 29.

Contemporary QSAR classifiers compared.

J Chem Inf Model. 2007 Jan-Feb;47(1):219-27. doi: 10.1021/ci600332j.

QSAR model as a random event: A case of rat toxicity.

Bioorg Med Chem. 2015 Mar 15;23(6):1223-30. doi: 10.1016/j.bmc.2015.01.055. Epub 2015 Feb 7.

引用本文的文献

Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts.

ACS Omega. 2025 Apr 10;10(15):14626-14639. doi: 10.1021/acsomega.4c06997. eCollection 2025 Apr 22.

Rethinking the applicability domain analysis in QSAR models.

J Comput Aided Mol Des. 2024 Feb 14;38(1):9. doi: 10.1007/s10822-024-00550-8.

Characterizing Soil Profile Salinization in Cotton Fields Using Landsat 8 Time-Series Data in Southern Xinjiang, China.

Sensors (Basel). 2023 Aug 7;23(15):7003. doi: 10.3390/s23157003.

Uncertainty quantification: Can we trust artificial intelligence in drug discovery?

iScience. 2022 Jul 21;25(8):104814. doi: 10.1016/j.isci.2022.104814. eCollection 2022 Aug 19.

HobPre: accurate prediction of human oral bioavailability for small molecules.

J Cheminform. 2022 Jan 6;14(1):1. doi: 10.1186/s13321-021-00580-6.

Machine Learning Applied to the Modeling of Pharmacological and ADMET Endpoints.

Methods Mol Biol. 2022;2390:61-101. doi: 10.1007/978-1-0716-1787-8_2.

A hybrid framework for improving uncertainty quantification in deep learning-based QSAR regression modeling.

J Cheminform. 2021 Sep 20;13(1):69. doi: 10.1186/s13321-021-00551-x.

Materials Precursor Score: Modeling Chemists' Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors.

J Chem Inf Model. 2021 Sep 27;61(9):4342-4356. doi: 10.1021/acs.jcim.1c00375. Epub 2021 Aug 13.

Quantitative structure-activity relationship models for genotoxicity prediction based on combination evaluation strategies for toxicological alternative experiments.

Sci Rep. 2021 Apr 13;11(1):8030. doi: 10.1038/s41598-021-87035-y.

QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping.

J Cheminform. 2020 May 29;12(1):39. doi: 10.1186/s13321-020-00443-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用随机森林模型来模拟另一个随机森林模型的领域适用性。

Using random forest to model the domain applicability of another random forest model.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献