通过定量构效关系和机器学习方法对来自雌激素受体测定的大量环境化学物质进行二元分类。

Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods.

作者信息

Zang Qingda, Rotroff Daniel M, Judson Richard S

机构信息

ORISE Postdoctoral Fellow and ‡National Center for Computational Toxicology, U.S. Environmental Protection Agency , Research Triangle Park, North Carolina 27711, United States.

出版信息

J Chem Inf Model. 2013 Dec 23;53(12):3244-61. doi: 10.1021/ci400527b. Epub 2013 Dec 11.

DOI:10.1021/ci400527b

PMID:24279462

Abstract

There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. The ToxCast and Tox21 programs have tested ∼8200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity. The present work uses this large data set to develop in silico quantitative structure-activity relationship (QSAR) models using machine learning (ML) methods and a novel approach to manage the imbalanced data distribution. Training compounds from the ToxCast project were categorized as active or inactive (binding or nonbinding) classes based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays. A total of 1537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. In order to handle the imbalanced distribution of active and inactive chemicals, we developed a cluster-selection strategy to minimize information loss and increase predictive performance and compared this strategy to three currently popular techniques: cost-sensitive learning, oversampling of the minority class, and undersampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 bits of structural fingerprints as explanatory variables. A random forest (RF) feature selection method was employed to extract the structural features most relevant to the ER activity. The best model was obtained using SVM in combination with a subset of descriptors identified from a large set via the RF algorithm, which recognized the active and inactive compounds at the accuracies of 76.1% and 82.8% with a total accuracy of 81.6% on the internal test set and 70.8% on the external test set. These results demonstrate that a combination of high-quality experimental data and ML methods can lead to robust models that achieve excellent predictive accuracy, which are potentially useful for facilitating the virtual screening of chemicals for environmental risk assessment.

摘要

有成千上万种环境化学物质需要就其内分泌干扰潜力做出监管决策。ToxCast和Tox21项目在一个广泛的体外高通量筛选（HTS）分析筛选组中，针对雌激素受体（ER）激动剂和拮抗剂活性测试了约8200种化学物质。本研究利用这个大型数据集，采用机器学习（ML）方法和一种处理不平衡数据分布的新方法，开发了计算机定量构效关系（QSAR）模型。来自ToxCast项目的训练化合物根据从13种ER体外分析收集得到的综合ER相互作用评分，被分类为活性或非活性（结合或非结合）类别。总共1537种来自ToxCast的化学物质用于推导和优化二元分类模型，而另外5073种来自Tox21项目的化学物质（在13种体外分析中的2种中进行了评估）用于外部验证模型性能。为了处理活性和非活性化学物质的不平衡分布，我们开发了一种聚类选择策略，以尽量减少信息损失并提高预测性能，并将该策略与目前三种流行技术进行比较：成本敏感学习、少数类过采样和多数类欠采样。利用线性判别分析（LDA）、分类与回归树（CART）以及支持向量机（SVM），以来自QikProp的51个分子描述符和4328位结构指纹作为解释变量，构建了QSAR分类模型，将化学物质的分子结构与其ER活性联系起来。采用随机森林（RF）特征选择方法提取与ER活性最相关的结构特征。使用SVM结合通过RF算法从大量数据中识别出的描述符子集获得了最佳模型，该模型在内部测试集上识别活性和非活性化合物的准确率分别为76.1%和82.8%，总准确率为81.6%，在外部测试集上的准确率为70.8%。这些结果表明，高质量的实验数据和ML方法相结合可以产生具有出色预测准确性的稳健模型，这对于促进用于环境风险评估的化学物质虚拟筛选可能是有用的。

相似文献

Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods.

J Chem Inf Model. 2013 Dec 23;53(12):3244-61. doi: 10.1021/ci400527b. Epub 2013 Dec 11.

Identification of putative estrogen receptor-mediated endocrine disrupting chemicals using QSAR- and structure-based virtual screening approaches.

Toxicol Appl Pharmacol. 2013 Oct 1;272(1):67-76. doi: 10.1016/j.taap.2013.04.032. Epub 2013 May 23.

In silico screening of estrogen-like chemicals based on different nonlinear classification models.

J Mol Graph Model. 2007 Jul;26(1):135-44. doi: 10.1016/j.jmgm.2007.01.003. Epub 2007 Jan 17.

Predicting hepatotoxicity using ToxCast in vitro bioactivity and chemical structure.

Chem Res Toxicol. 2015 Apr 20;28(4):738-51. doi: 10.1021/tx500501h. Epub 2015 Mar 9.

In Silico Study of In Vitro GPCR Assays by QSAR Modeling.

Methods Mol Biol. 2016;1425:361-81. doi: 10.1007/978-1-4939-3609-0_16.

A ternary classification using machine learning methods of distinct estrogen receptor activities within a large collection of environmental chemicals.

Sci Total Environ. 2017 Feb 15;580:1268-1275. doi: 10.1016/j.scitotenv.2016.12.088. Epub 2016 Dec 20.

Classification and virtual screening of androgen receptor antagonists.

J Chem Inf Model. 2010 May 24;50(5):861-74. doi: 10.1021/ci100078u.

Ligand-based virtual screening and in silico design of new antimalarial compounds using nonstochastic and stochastic total and atom-type quadratic maps.

J Chem Inf Model. 2005 Jul-Aug;45(4):1082-100. doi: 10.1021/ci050085t.

Development and Validation of Decision Forest Model for Estrogen Receptor Binding Prediction of Chemicals Using Large Data Sets.

Chem Res Toxicol. 2015 Dec 21;28(12):2343-51. doi: 10.1021/acs.chemrestox.5b00358. Epub 2015 Nov 12.

A comprehensive support vector machine binary hERG classification model based on extensive but biased end point hERG data sets.

Chem Res Toxicol. 2011 Jun 20;24(6):934-49. doi: 10.1021/tx200099j. Epub 2011 May 6.

引用本文的文献

MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data.

J Cheminform. 2025 Jan 31;17(1):14. doi: 10.1186/s13321-025-00950-4.

Machine Learning Methods for Endocrine Disrupting Potential Identification Based on Single-Cell Data.

Chem Eng Sci. 2023 Nov 5;281. doi: 10.1016/j.ces.2023.119086. Epub 2023 Jul 18.

Review of studies dedicated to the nuclear receptor family: Therapeutic prospects and toxicological concerns.

Front Endocrinol (Lausanne). 2022 Sep 13;13:986016. doi: 10.3389/fendo.2022.986016. eCollection 2022.

Machine Learning Models for Predicting Liver Toxicity.

Methods Mol Biol. 2022;2425:393-415. doi: 10.1007/978-1-0716-1960-5_15.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Channel Interactions and Robust Inference for Ratiometric β-lactamase Assay Data: a Tox21 Library Analysis.

ACS Sustain Chem Eng. 2018 Jan 15;6(3):3233-3241. doi: 10.1021/acssuschemeng.7b03394.

DeepSnap-Deep Learning Approach Predicts Progesterone Receptor Antagonist Activity With High Performance.

Front Bioeng Biotechnol. 2020 Jan 22;7:485. doi: 10.3389/fbioe.2019.00485. eCollection 2019.

Undersampling: case studies of flaviviral inhibitory activities.

J Comput Aided Mol Des. 2019 Nov;33(11):997-1008. doi: 10.1007/s10822-019-00255-3. Epub 2019 Nov 26.

G-Networks to Predict the Outcome of Sensing of Toxicity.

Sensors (Basel). 2018 Oct 16;18(10):3483. doi: 10.3390/s18103483.

Integrating docking scores and key interaction profiles to improve the accuracy of molecular docking: towards novel B-Raf inhibitors.

Medchemcomm. 2017 Jul 24;8(9):1835-1844. doi: 10.1039/c7md00229g. eCollection 2017 Sep 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过定量构效关系和机器学习方法对来自雌激素受体测定的大量环境化学物质进行二元分类。

Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献