通过超参数优化注意过拟合！

Be aware of overfitting by hyperparameter optimization!

作者信息

Tetko Igor V, van Deursen Ruud, Godin Guillaume

机构信息

Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum Für Gesundheit Und Umwelt (GmbH), 86764, Neuherberg, Germany.

BIGCHEM GmbH, Valerystr. 49, 85716, Unterschleißheim, Germany.

出版信息

J Cheminform. 2024 Dec 9;16(1):139. doi: 10.1186/s13321-024-00934-w.

DOI:10.1186/s13321-024-00934-w

PMID:39654058

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11629497/

Abstract

Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.

摘要

超参数优化在机器学习中经常被使用。然而，对大量参数空间进行优化可能会导致模型过拟合。在最近关于溶解度预测的研究中，作者从不同数据源收集了七个热力学和动力学溶解度数据集。他们使用了基于图的先进方法，并比较了使用不同数据清理协议和超参数优化为每个数据集开发的模型。在我们的研究中，我们表明超参数优化并不总是能得到更好的模型，这可能是由于使用相同统计量时出现了过拟合。使用预先设定的超参数也能得到类似的结果，计算量可减少约10000倍。我们还通过添加一种基于微笑的自然语言处理的表示学习方法（称为Transformer CNN）扩展了先前的分析。我们表明，在所有使用完全相同协议的分析集中，Transformer CNN在28个成对比较中的26个中比基于图的方法提供了更好的结果，并且与其他方法相比仅使用了极少的时间。最后但同样重要的是，我们强调了使用完全相同的统计量来比较计算结果的重要性。科学贡献我们表明，具有预优化超参数的模型可能会过拟合，而使用预先设定的超参数能产生类似的性能，但速度快四个数量级。与其他研究方法相比，Transformer CNN提供了显著更高的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4def/11629497/88434e19fc8f/13321_2024_934_Fig1_HTML.jpg

相似文献

Be aware of overfitting by hyperparameter optimization!通过超参数优化注意过拟合！

J Cheminform. 2024 Dec 9;16(1):139. doi: 10.1186/s13321-024-00934-w.

Improving classification accuracy of fine-tuned CNN models: Impact of hyperparameter optimization.提高微调卷积神经网络（CNN）模型的分类准确率：超参数优化的影响。

Heliyon. 2024 Feb 23;10(5):e26586. doi: 10.1016/j.heliyon.2024.e26586. eCollection 2024 Mar 15.

Heuristic hyperparameter optimization of deep learning models for genomic prediction.启发式深度学习模型的基因组预测超参数优化。

G3 (Bethesda). 2021 Jul 14;11(7). doi: 10.1093/g3journal/jkab032.

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling.MABAL：一种用于机器辅助骨龄标注的新型深度学习架构。

J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.

Enhanced Pneumonia Detection in Chest X-Rays Using Hybrid Convolutional and Vision Transformer Networks.使用混合卷积和视觉Transformer网络增强胸部X光片中的肺炎检测

Curr Med Imaging. 2025;21:e15734056326685. doi: 10.2174/0115734056326685250101113959.

Hyperparameter selection for dataset-constrained semantic segmentation: Practical machine learning optimization.数据集受限语义分割的超参数选择：实用机器学习优化

J Appl Clin Med Phys. 2024 Dec;25(12):e14542. doi: 10.1002/acm2.14542. Epub 2024 Oct 10.

Ensemble genetic and CNN model-based image classification by enhancing hyperparameter tuning.通过加强超参数调整实现基于集成遗传算法和卷积神经网络模型的图像分类

Sci Rep. 2025 Jan 6;15(1):1003. doi: 10.1038/s41598-024-76178-3.

Multiple objectives escaping bird search optimization and its application in stock market prediction based on transformer model.多目标逃逸鸟搜索优化算法及其在基于变压器模型的股票市场预测中的应用

Sci Rep. 2025 Feb 17;15(1):5730. doi: 10.1038/s41598-025-88883-8.

Ant Colony-Based Hyperparameter Optimisation in Total Variation Reconstruction in X-ray Computed Tomography.基于蚁群算法的 X 射线计算机断层扫描全变差重建中的超参数优化。

Sensors (Basel). 2021 Jan 15;21(2):591. doi: 10.3390/s21020591.

A Novel MRI Diagnosis Method for Brain Tumor Classification Based on CNN and Bayesian Optimization.一种基于卷积神经网络和贝叶斯优化的脑肿瘤分类新型磁共振成像诊断方法。

Healthcare (Basel). 2022 Mar 8;10(3):494. doi: 10.3390/healthcare10030494.

引用本文的文献

Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.日语胸部计算机断层扫描报告大规模数据集的开发及高性能发现分类模型：数据集开发与验证研究

JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.

Advanced machine learning for innovative drug discovery.用于创新药物发现的先进机器学习技术。

J Cheminform. 2025 Aug 8;17(1):122. doi: 10.1186/s13321-025-01061-w.

UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines.基于UMAP的聚类划分，用于对癌细胞系虚拟筛选的人工智能模型进行严格评估。

J Cheminform. 2025 Jun 10;17(1):94. doi: 10.1186/s13321-025-01039-8.

Generalizable, fast, and accurate DeepQSPR with fastprop.具有快速传播的可推广、快速且准确的深度定量构效关系模型。

J Cheminform. 2025 May 13;17(1):73. doi: 10.1186/s13321-025-01013-4.

Machine Learning for Toxicity Prediction Using Chemical Structures: Pillars for Success in the Real World.利用化学结构进行毒性预测的机器学习：在现实世界中取得成功的支柱。

Chem Res Toxicol. 2025 May 19;38(5):759-807. doi: 10.1021/acs.chemrestox.5c00033. Epub 2025 May 2.

AttenhERG: a reliable and interpretable graph neural network framework for predicting hERG channel blockers.AttenhERG：一种用于预测人乙醚 - 去极化相关基因（hERG）通道阻滞剂的可靠且可解释的图神经网络框架。

J Cheminform. 2024 Dec 23;16(1):143. doi: 10.1186/s13321-024-00940-y.

本文引用的文献

Tox24 Challenge.Tox24挑战

Chem Res Toxicol. 2024 Jun 17;37(6):825-826. doi: 10.1021/acs.chemrestox.4c00192. Epub 2024 May 21.

Outline and background for the EU-OS solubility prediction challenge.EU-OS 溶解度预测挑战赛概述及背景。

SLAS Discov. 2024 Jun;29(4):100155. doi: 10.1016/j.slasd.2024.100155. Epub 2024 Mar 20.

The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS/SLAS joint compound solubility challenge.在首届欧盟开放科学/实验室自动化协会联合化合物溶解度挑战赛中，openOCHEM共识模型是表现最佳的开源预测模型。

SLAS Discov. 2024 Mar;29(2):100144. doi: 10.1016/j.slasd.2024.01.005. Epub 2024 Feb 3.

Boosting the predictive performance with aqueous solubility dataset curation.通过对水溶解度数据集的整理来提高预测性能。

Sci Data. 2022 Mar 3;9(1):71. doi: 10.1038/s41597-022-01154-3.

Pushing the limits of solubility prediction via quality-oriented data selection.通过面向质量的数据选择拓展溶解度预测的极限。

iScience. 2020 Dec 17;24(1):101961. doi: 10.1016/j.isci.2020.101961. eCollection 2021 Jan 22.

Transformer-CNN: Swiss knife for QSAR modeling and interpretation.Transformer-CNN：用于QSAR建模与解释的多功能工具

J Cheminform. 2020 Mar 18;12(1):17. doi: 10.1186/s13321-020-00423-w.

Structure-Activity Relationship Modeling and Experimental Validation of the Imidazolium and Pyridinium Based Ionic Liquids as Potential Antibacterials of MDR and .咪唑鎓和吡啶鎓基离子液体作为潜在的耐多药和革兰氏阴性菌抗菌剂的构效关系建模和实验验证。

Int J Mol Sci. 2021 Jan 8;22(2):563. doi: 10.3390/ijms22020563.

Focused Library Generator: case of Mdmx inhibitors.聚焦文库生成：Mdmx 抑制剂案例。

J Comput Aided Mol Des. 2020 Jul;34(7):769-782. doi: 10.1007/s10822-019-00242-8. Epub 2019 Nov 1.

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.利用图注意力机制拓展药物发现中分子表示的边界。

J Med Chem. 2020 Aug 27;63(16):8749-8760. doi: 10.1021/acs.jmedchem.9b00959. Epub 2019 Aug 27.

AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds.AqSolDB，一个经过精心整理的水溶性参考数据集，包含了一组多样化化合物的 2D 描述符。

Sci Data. 2019 Aug 8;6(1):143. doi: 10.1038/s41597-019-0151-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过超参数优化注意过拟合！

Be aware of overfitting by hyperparameter optimization!

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献