• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

低数据情况下超越线性模型的机器学习工作流程。

Machine learning workflows beyond linear models in low-data regimes.

作者信息

Dalmau David, Sigman Matthew S, Alegre-Requena Juan V

机构信息

Departamento de Química Inorgánica, Instituto de Síntesis Química y Catálisis Homogénea (ISQCH), CSIC-Universidad de Zaragoza C/Pedro Cerbuna 12 50009 Zaragoza Spain

Department of Chemistry, University of Utah 315 South 1400 East Salt Lake City Utah 84112 USA.

出版信息

Chem Sci. 2025 Apr 15;16(19):8555-8560. doi: 10.1039/d5sc00996k. eCollection 2025 May 14.

DOI:10.1039/d5sc00996k
PMID:40242845
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11997861/
Abstract

Data-driven methodologies are transforming chemical research by providing chemists with digital tools that accelerate discovery and promote sustainability. In this context, non-linear machine learning algorithms are among the most disruptive technologies in the field and have proven effective for handling large datasets. However, in data-limited scenarios, linear regression has traditionally prevailed due to its simplicity and robustness, while non-linear models have been met with skepticism over concerns related to interpretability and overfitting. In this study, we introduce ready-to-use, automated workflows designed to overcome these challenges. These frameworks mitigate overfitting through Bayesian hyperparameter optimization by incorporating an objective function that accounts for overfitting in both interpolation and extrapolation. Benchmarking on eight diverse chemical datasets, ranging from 18 to 44 data points, demonstrates that when properly tuned and regularized, non-linear models can perform on par with or outperform linear regression. Furthermore, interpretability assessments and predictions reveal that non-linear models capture underlying chemical relationships similarly to their linear counterparts. Ultimately, the automated non-linear workflows presented have the potential to become valuable tools in a chemist's toolbox for studying problems in low-data regimes alongside traditional linear models.

摘要

数据驱动方法正在通过为化学家提供加速发现和促进可持续性的数字工具来改变化学研究。在这种背景下,非线性机器学习算法是该领域最具颠覆性的技术之一,并且已被证明在处理大型数据集方面是有效的。然而,在数据有限的情况下,线性回归由于其简单性和稳健性传统上一直占据主导地位,而非线性模型则因与可解释性和过拟合相关的问题而受到质疑。在本研究中,我们引入了旨在克服这些挑战的即用型自动化工作流程。这些框架通过纳入一个在插值和外推中都考虑过拟合的目标函数,通过贝叶斯超参数优化来减轻过拟合。对八个不同化学数据集(数据点从18个到44个不等)的基准测试表明,经过适当调整和正则化后,非线性模型的表现可以与线性回归相当或优于线性回归。此外,可解释性评估和预测表明,非线性模型与线性模型类似地捕捉到了潜在的化学关系。最终,所提出的自动化非线性工作流程有可能成为化学家工具箱中的宝贵工具,与传统线性模型一起用于研究低数据情况下的问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/16c0229cdadf/d5sc00996k-f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/f28a1e8a278f/d5sc00996k-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/7bf9601be431/d5sc00996k-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/a160da72daf4/d5sc00996k-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/34f345624cd7/d5sc00996k-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/54d4c99d001a/d5sc00996k-f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/16c0229cdadf/d5sc00996k-f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/f28a1e8a278f/d5sc00996k-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/7bf9601be431/d5sc00996k-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/a160da72daf4/d5sc00996k-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/34f345624cd7/d5sc00996k-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/54d4c99d001a/d5sc00996k-f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/943c/12077359/16c0229cdadf/d5sc00996k-f6.jpg

相似文献

1
Machine learning workflows beyond linear models in low-data regimes.低数据情况下超越线性模型的机器学习工作流程。
Chem Sci. 2025 Apr 15;16(19):8555-8560. doi: 10.1039/d5sc00996k. eCollection 2025 May 14.
2
Metis: a python-based user interface to collect expert feedback for generative chemistry models.梅蒂斯:一个基于Python的用户界面,用于收集生成化学模型的专家反馈。
J Cheminform. 2024 Aug 14;16(1):100. doi: 10.1186/s13321-024-00892-3.
3
Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。
BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.
4
Comparative Study of Machine Learning and System Identification for Process Systems Engineering Dynamics.过程系统工程动力学中机器学习与系统辨识的比较研究
Ind Eng Chem Res. 2025 Feb 12;64(8):4450-4478. doi: 10.1021/acs.iecr.4c03264. eCollection 2025 Feb 26.
5
Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions.噪声裁剪:一种使用先验知识集成和最大割解决方案实现二进制数据噪声容忍分类的 Python 包。
BMC Bioinformatics. 2024 Apr 20;25(1):155. doi: 10.1186/s12859-024-05769-8.
6
Spatial and spatiotemporal modelling of intra-urban ultrafine particles: A comparison of linear, nonlinear, regularized, and machine learning methods.城市内部超细颗粒物的空间和时空建模:线性、非线性、正则化和机器学习方法的比较。
Sci Total Environ. 2024 Dec 1;954:176523. doi: 10.1016/j.scitotenv.2024.176523. Epub 2024 Sep 24.
7
Machine Learning-Based Boosted Regression Ensemble Combined with Hyperparameter Tuning for Optimal Adaptive Learning.基于机器学习的增强回归集成与超参数调整相结合,实现最优自适应学习。
Sensors (Basel). 2022 May 16;22(10):3776. doi: 10.3390/s22103776.
8
DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants.DNNGP,一种基于深度神经网络的方法,用于利用植物中的多组学数据进行基因组预测。
Mol Plant. 2023 Jan 2;16(1):279-293. doi: 10.1016/j.molp.2022.11.004. Epub 2022 Nov 10.
9
Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States.机器学习和土地利用回归在精细时空估算环境空气污染中的比较:在美国大陆范围内模拟臭氧浓度。
Environ Int. 2020 Sep;142:105827. doi: 10.1016/j.envint.2020.105827. Epub 2020 Jun 25.
10
Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools.使用和基准测试计算代谢组学生物标志物注释工具的良好实践和建议。
Metabolomics. 2022 Dec 5;18(12):103. doi: 10.1007/s11306-022-01963-y.

本文引用的文献

1
Applying statistical modeling strategies to sparse datasets in synthetic chemistry.将统计建模策略应用于合成化学中的稀疏数据集。
Sci Adv. 2025 Jan 3;11(1):eadt3013. doi: 10.1126/sciadv.adt3013. Epub 2025 Jan 1.
2
Understanding overfitting in random forest for probability estimation: a visualization and simulation study.理解随机森林在概率估计中的过拟合:可视化与模拟研究。
Diagn Progn Res. 2024 Sep 27;8(1):14. doi: 10.1186/s41512-024-00177-1.
3
Digitalization paving the ways for sustainable chemistry: switching on more green lights.
数字化为可持续化学铺平道路:点亮更多绿灯。
Science. 2024 Jun 14;384(6701):eadq3537. doi: 10.1126/science.adq3537. Epub 2024 Jun 13.
4
A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models.使用软传感器模型预测空气中臭氧的线性回归、神经网络和随机森林回归的比较分析。
Sci Rep. 2023 Dec 16;13(1):22420. doi: 10.1038/s41598-023-49899-0.
5
Using Data-Driven Learning to Predict and Control the Outcomes of Inorganic Materials Synthesis.利用数据驱动学习预测和控制无机材料合成的结果。
Inorg Chem. 2023 Oct 9;62(40):16251-16262. doi: 10.1021/acs.inorgchem.3c02697. Epub 2023 Sep 28.
6
Small Data Can Play a Big Role in Chemical Discovery.小数据在化学发现中可发挥大作用。
Angew Chem Int Ed Engl. 2023 Jun 26;62(26):e202219070. doi: 10.1002/anie.202219070. Epub 2023 Apr 26.
7
Data-Driven Multi-Objective Optimization Tactics for Catalytic Asymmetric Reactions Using Bisphosphine Ligands.基于数据驱动的双膦配体催化不对称反应的多目标优化策略。
J Am Chem Soc. 2023 Jan 11;145(1):110-121. doi: 10.1021/jacs.2c08513. Epub 2022 Dec 27.
8
Leveraging Regio- and Stereoselective C(sp)-H Functionalization of Silyl Ethers to Train a Logistic Regression Classification Model for Predicting Site-Selectivity Bias.利用硅醚的区域和立体选择性 C(sp)-H 功能化来训练逻辑回归分类模型,以预测位点选择性偏差。
J Am Chem Soc. 2022 Aug 31;144(34):15549-15561. doi: 10.1021/jacs.2c04383. Epub 2022 Aug 17.
9
Predicting relative efficiency of amide bond formation using multivariate linear regression.利用多元线性回归预测酰胺键形成的相对效率。
Proc Natl Acad Sci U S A. 2022 Apr 19;119(16):e2118451119. doi: 10.1073/pnas.2118451119. Epub 2022 Apr 11.
10
Ni/Photoredox-Catalyzed Enantioselective Cross-Electrophile Coupling of Styrene Oxides with Aryl Iodides.镍/光氧化还原催化的苯乙烯氧化物与芳基碘化物的对映选择性交叉电偶联。
J Am Chem Soc. 2021 Sep 29;143(38):15873-15881. doi: 10.1021/jacs.1c08105. Epub 2021 Sep 20.