Schwalbe-Koda Daniel, Hamel Sebastien, Sadigh Babak, Zhou Fei, Lordi Vincenzo
Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
Department of Materials Science and Engineering, University of California, Los Angeles, CA, 90095, USA.
Nat Commun. 2025 Apr 29;16(1):4014. doi: 10.1038/s41467-025-59232-0.
An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.
对信息的准确描述与原子机器学习(ML)中的一系列问题相关,例如构建训练集、进行不确定性量化(UQ)或从大型数据集中提取物理见解。然而,原子ML通常依赖无监督学习或模型预测来分析来自模拟或训练数据的信息内容。在此,我们引入一个理论框架,该框架提供了一种严格的、无模型的工具来量化原子模拟中的信息内容。我们证明,以原子为中心的环境分布的信息熵解释了ML势发展中已知的启发式方法,从训练集大小到数据集最优性。使用这个工具,我们提出了一种无模型的UQ方法,该方法能够可靠地预测认知不确定性并检测分布外样本,包括成核等系统中的罕见事件。这种方法为数据驱动的原子建模提供了一个通用工具,并结合了ML、模拟和物理解释性方面的工作。