基于PubChem中不平衡高通量筛选数据的定量构效关系建模

QSAR modeling of imbalanced high-throughput screening data in PubChem.

作者信息

Zakharov Alexey V, Peach Megan L, Sitzmann Markus, Nicklaus Marc C

机构信息

CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States.

出版信息

J Chem Inf Model. 2014 Mar 24;54(3):705-12. doi: 10.1021/ci400737s. Epub 2014 Feb 28.

DOI:10.1021/ci400737s

PMID:24524735

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3985743/

Abstract

Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).

摘要

许多PubChem中的结构都标注有通过高通量筛选（HTS）测定确定的活性。由于这些测定的性质，活性数据通常严重不平衡，少数活性化合物与大量非活性化合物形成对比。我们使用了几个这样不平衡的PubChem HTS测定来测试和开发从不平衡数据集中有效构建稳健QSAR模型的策略。在GUSAR程序中使用了不同的描述符类型[原子定量邻域（QNA）和“生物学”描述符]来生成各种QSAR模型。使用外部测试集和验证集对获得的模型进行比较。我们还报告了我们将最具预测性的模型纳入公开可用的NCI/CADD Group网络服务（http://cactus.nci.nih.gov/chemical/apps/cap）的努力。

相似文献

QSAR modeling of imbalanced high-throughput screening data in PubChem.基于PubChem中不平衡高通量筛选数据的定量构效关系建模

J Chem Inf Model. 2014 Mar 24;54(3):705-12. doi: 10.1021/ci400737s. Epub 2014 Feb 28.

Improving virtual screening predictive accuracy of Human kallikrein 5 inhibitors using machine learning models.使用机器学习模型提高人激肽释放酶5抑制剂的虚拟筛选预测准确性。

Comput Biol Chem. 2017 Aug;69:110-119. doi: 10.1016/j.compbiolchem.2017.05.007. Epub 2017 May 29.

HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data.HTS 导航器：免费访问的 cheminformatics 软件，用于分析高通量筛选数据。

Bioinformatics. 2014 Feb 15;30(4):588-9. doi: 10.1093/bioinformatics/btt718. Epub 2013 Dec 28.

Automatically detecting workflows in PubChem.自动检测化学物质信息数据库中的工作流程。

J Biomol Screen. 2012 Sep;17(8):1071-9. doi: 10.1177/1087057112449054. Epub 2012 Jun 12.

QSAR classification model for antibacterial compounds and its use in virtual screening.抗菌化合物的定量构效关系分类模型及其在虚拟筛选中的应用。

J Chem Inf Model. 2012 Oct 22;52(10):2559-69. doi: 10.1021/ci300336v. Epub 2012 Oct 8.

QNA-based 'Star Track' QSAR approach.基于问答的“明星轨迹”定量构效关系方法。

SAR QSAR Environ Res. 2009 Oct;20(7-8):679-709. doi: 10.1080/10629360903438370.

The Development of a Weighted Index to Optimise Compound Libraries for High Throughput Screening.一种用于高通量筛选的化合物库优化加权指数的开发。

Mol Inform. 2019 Mar;38(3):e1800068. doi: 10.1002/minf.201800068. Epub 2018 Oct 22.

Application of QSAR and shape pharmacophore modeling approaches for targeted chemical library design.定量构效关系（QSAR）和形状药效团建模方法在靶向化学文库设计中的应用。

Methods Mol Biol. 2011;685:111-33. doi: 10.1007/978-1-60761-931-4_6.

Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database.基于配体的虚拟高通量筛选与 PubChem 数据库的基准测试。

Molecules. 2013 Jan 8;18(1):735-56. doi: 10.3390/molecules18010735.

Computational tools and resources for metabolism-related property predictions. 2. Application to prediction of half-life time in human liver microsomes.代谢相关性质预测的计算工具和资源。2. 在人肝微粒体中半衰期预测中的应用。

Future Med Chem. 2012 Oct;4(15):1933-44. doi: 10.4155/fmc.12.152.

引用本文的文献

Development of synthetic chloride transporters using high-throughput screening and machine learning.利用高通量筛选和机器学习开发合成氯离子转运体。

Digit Discov. 2025 Aug 13. doi: 10.1039/d5dd00140d.

Adjusted imbalance ratio leads to effective AI-based drug discovery against infectious disease.调整后的失衡率有助于基于人工智能的有效传染病药物发现。

Sci Rep. 2025 Aug 12;15(1):29563. doi: 10.1038/s41598-025-15265-5.

Developing muscarinic receptor M1 classification models utilizing transfer learning and generative AI techniques.利用迁移学习和生成式人工智能技术开发毒蕈碱受体M1分类模型。

Sci Rep. 2025 May 12;15(1):16486. doi: 10.1038/s41598-025-00972-w.

hERG toxicity prediction in early drug discovery using extreme gradient boosting and isometric stratified ensemble mapping.使用极端梯度提升和等距分层集成映射在早期药物发现中预测人乙醚-a-去极化相关基因（hERG）毒性

Sci Rep. 2025 May 4;15(1):15585. doi: 10.1038/s41598-025-99766-3.

Combining Machine Learning and Electrophysiology for Insect Odorant Receptor Studies.结合机器学习与电生理学用于昆虫气味受体研究

Methods Mol Biol. 2025;2915:101-116. doi: 10.1007/978-1-0716-4466-9_5.

One size does not fit all: revising traditional paradigms for assessing accuracy of QSAR models used for virtual screening.一刀切并不适用于所有情况：修正用于虚拟筛选的QSAR模型准确性评估的传统范式。

J Cheminform. 2025 Jan 16;17(1):7. doi: 10.1186/s13321-025-00948-y.

Prediction of Pt, Ir, Ru, and Rh complexes light absorption in the therapeutic window for phototherapy using machine learning.使用机器学习预测铂、铱、钌和铑配合物在光疗治疗窗口内的光吸收。

J Cheminform. 2025 Jan 5;17(1):1. doi: 10.1186/s13321-024-00939-5.

QSAR Modeling and Biological Testing of Some 15-LOX Inhibitors in a Series of Homo- and Heterocyclic Compounds.一系列同环和杂环化合物中某些15-脂氧合酶抑制剂的定量构效关系建模与生物学测试

Molecules. 2024 Nov 23;29(23):5540. doi: 10.3390/molecules29235540.

A Novel Machine Learning Model and a Web Portal for Predicting the Human Skin Sensitization Effects of Chemical Agents.一种用于预测化学试剂对人体皮肤致敏作用的新型机器学习模型及网络门户。

Toxics. 2024 Nov 7;12(11):803. doi: 10.3390/toxics12110803.

Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines.机器学习驱动的数据估值优化高通量筛选管道。

J Chem Inf Model. 2024 Nov 11;64(21):8142-8152. doi: 10.1021/acs.jcim.4c01547. Epub 2024 Oct 23.

本文引用的文献

QSAR Modelling of Rat Acute Toxicity on the Basis of PASS Prediction.基于 PASS 预测的大鼠急性毒性 QSAR 建模。

Mol Inform. 2011 Mar 14;30(2-3):241-50. doi: 10.1002/minf.201000151. Epub 2011 Mar 18.

Coping with unbalanced class data sets in oral absorption models.应对口服吸收模型中不平衡的数据集。

J Chem Inf Model. 2013 Feb 25;53(2):461-74. doi: 10.1021/ci300348u. Epub 2013 Jan 24.

Discovery of novel antimalarial compounds enabled by QSAR-based virtual screening.基于 QSAR 的虚拟筛选发现新型抗疟化合物。

J Chem Inf Model. 2013 Feb 25;53(2):475-92. doi: 10.1021/ci300421n. Epub 2013 Jan 23.

Future Med Chem. 2012 Oct;4(15):1933-44. doi: 10.4155/fmc.12.152.

Quantitative prediction of antitarget interaction profiles for chemical compounds.定量预测化合物的抗靶相互作用谱。

Chem Res Toxicol. 2012 Nov 19;25(11):2378-85. doi: 10.1021/tx300247r. Epub 2012 Nov 2.

Scientific workflow systems: Pipeline Pilot and KNIME.科学工作流系统：管道先导（Pipeline Pilot）和康奈姆（KNIME）。

J Comput Aided Mol Des. 2012 Jul;26(7):801-4. doi: 10.1007/s10822-012-9577-7. Epub 2012 May 27.

Machine learning methods for property prediction in chemoinformatics: Quo Vadis?机器学习在化学信息学中的性质预测方法：何去何从？

J Chem Inf Model. 2012 Jun 25;52(6):1413-37. doi: 10.1021/ci200409x. Epub 2012 May 25.

In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner.基于随机森林学习者的不平衡数据中酚类化合物毒性作用机制的计算预测。

J Mol Graph Model. 2012 May;35:21-7. doi: 10.1016/j.jmgm.2012.01.002. Epub 2012 Jan 17.

Comparison of random forest and Pipeline Pilot Naïve Bayes in prospective QSAR predictions.随机森林与 Pipeline Pilot Naïve Bayes 在前瞻性 QSAR 预测中的比较。

J Chem Inf Model. 2012 Mar 26;52(3):792-803. doi: 10.1021/ci200615h. Epub 2012 Mar 8.

ChEMBL: a large-scale bioactivity database for drug discovery.ChEMBL：用于药物发现的大型生物活性数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D1100-7. doi: 10.1093/nar/gkr777. Epub 2011 Sep 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验