Dalmau David, Sigman Matthew S, Alegre-Requena Juan V
Departamento de Química Inorgánica, Instituto de Síntesis Química y Catálisis Homogénea (ISQCH), CSIC-Universidad de Zaragoza C/Pedro Cerbuna 12 50009 Zaragoza Spain
Department of Chemistry, University of Utah 315 South 1400 East Salt Lake City Utah 84112 USA.
Chem Sci. 2025 Apr 15;16(19):8555-8560. doi: 10.1039/d5sc00996k. eCollection 2025 May 14.
Data-driven methodologies are transforming chemical research by providing chemists with digital tools that accelerate discovery and promote sustainability. In this context, non-linear machine learning algorithms are among the most disruptive technologies in the field and have proven effective for handling large datasets. However, in data-limited scenarios, linear regression has traditionally prevailed due to its simplicity and robustness, while non-linear models have been met with skepticism over concerns related to interpretability and overfitting. In this study, we introduce ready-to-use, automated workflows designed to overcome these challenges. These frameworks mitigate overfitting through Bayesian hyperparameter optimization by incorporating an objective function that accounts for overfitting in both interpolation and extrapolation. Benchmarking on eight diverse chemical datasets, ranging from 18 to 44 data points, demonstrates that when properly tuned and regularized, non-linear models can perform on par with or outperform linear regression. Furthermore, interpretability assessments and predictions reveal that non-linear models capture underlying chemical relationships similarly to their linear counterparts. Ultimately, the automated non-linear workflows presented have the potential to become valuable tools in a chemist's toolbox for studying problems in low-data regimes alongside traditional linear models.
数据驱动方法正在通过为化学家提供加速发现和促进可持续性的数字工具来改变化学研究。在这种背景下,非线性机器学习算法是该领域最具颠覆性的技术之一,并且已被证明在处理大型数据集方面是有效的。然而,在数据有限的情况下,线性回归由于其简单性和稳健性传统上一直占据主导地位,而非线性模型则因与可解释性和过拟合相关的问题而受到质疑。在本研究中,我们引入了旨在克服这些挑战的即用型自动化工作流程。这些框架通过纳入一个在插值和外推中都考虑过拟合的目标函数,通过贝叶斯超参数优化来减轻过拟合。对八个不同化学数据集(数据点从18个到44个不等)的基准测试表明,经过适当调整和正则化后,非线性模型的表现可以与线性回归相当或优于线性回归。此外,可解释性评估和预测表明,非线性模型与线性模型类似地捕捉到了潜在的化学关系。最终,所提出的自动化非线性工作流程有可能成为化学家工具箱中的宝贵工具,与传统线性模型一起用于研究低数据情况下的问题。