Häfner Dion, Gemmrich Johannes, Jochum Markus
Pasteur Labs, Brooklyn, NY 11205.
Niels Bohr Institute, University of Copenhagen, Copenhagen 2100, Denmark.
Proc Natl Acad Sci U S A. 2023 Nov 28;120(48):e2306275120. doi: 10.1073/pnas.2306275120. Epub 2023 Nov 20.
Big data and large-scale machine learning have had a profound impact on science and engineering, particularly in fields focused on forecasting and prediction. Yet, it is still not clear how we can use the superior pattern-matching abilities of machine learning models for scientific discovery. This is because the goals of machine learning and science are generally not aligned. In addition to being accurate, scientific theories must also be causally consistent with the underlying physical process and allow for human analysis, reasoning, and manipulation to advance the field. In this paper, we present a case study on discovering a symbolic model for oceanic rogue waves from data using causal analysis, deep learning, parsimony-guided model selection, and symbolic regression. We train an artificial neural network on causal features from an extensive dataset of observations from wave buoys, while selecting for predictive performance and causal invariance. We apply symbolic regression to distill this black-box model into a mathematical equation that retains the neural network's predictive capabilities, while allowing for interpretation in the context of existing wave theory. The resulting model reproduces known behavior, generates well-calibrated probabilities, and achieves better predictive scores on unseen data than current theory. This showcases how machine learning can facilitate inductive scientific discovery and paves the way for more accurate rogue wave forecasting.
大数据和大规模机器学习对科学与工程产生了深远影响,尤其是在专注于预测的领域。然而,我们仍不清楚如何利用机器学习模型卓越的模式匹配能力来进行科学发现。这是因为机器学习和科学的目标通常不一致。除了准确之外,科学理论还必须与潜在的物理过程在因果关系上保持一致,并允许人类进行分析、推理和操作以推动该领域的发展。在本文中,我们展示了一个案例研究,即使用因果分析、深度学习、简约引导的模型选择和符号回归从数据中发现海洋孤立波的符号模型。我们基于来自浮标观测的大量数据集的因果特征训练人工神经网络,同时选择预测性能和因果不变性。我们应用符号回归将这个黑箱模型提炼成一个数学方程,该方程保留了神经网络的预测能力,同时允许在现有波动理论的背景下进行解释。所得模型再现了已知行为,生成了校准良好的概率,并且在未见数据上比当前理论取得了更好的预测分数。这展示了机器学习如何促进归纳性科学发现,并为更准确的孤立波预测铺平了道路。