Fritz-Haber-Institute of the Max-Planck-Society, Faradayweg 4-6, 14195, Berlin, Germany.
Angew Chem Int Ed Engl. 2023 Jun 26;62(26):e202219170. doi: 10.1002/anie.202219170. Epub 2023 Apr 13.
Machine learning (ML) algorithms are currently emerging as powerful tools in all areas of science. Conventionally, ML is understood as a fundamentally data-driven endeavour. Unfortunately, large well-curated databases are sparse in chemistry. In this contribution, I therefore review science-driven ML approaches which do not rely on "big data", focusing on the atomistic modelling of materials and molecules. In this context, the term science-driven refers to approaches that begin with a scientific question and then ask what training data and model design choices are appropriate. As key features of science-driven ML, the automated and purpose-driven collection of data and the use of chemical and physical priors to achieve high data-efficiency are discussed. Furthermore, the importance of appropriate model evaluation and error estimation is emphasized.
机器学习(ML)算法目前在各个科学领域崭露头角,成为强大的工具。传统上,ML 被理解为一种完全依赖数据的努力。不幸的是,化学领域的大型、精心策划的数据库却很稀疏。在这篇综述中,我因此回顾了不依赖“大数据”的基于科学的 ML 方法,重点关注材料和分子的原子建模。在这种情况下,“基于科学的”一词是指从科学问题开始,然后询问哪些训练数据和模型设计选择是合适的方法。作为基于科学的 ML 的关键特征,讨论了自动化和有目的的数据收集以及使用化学和物理先验知识来实现高效数据利用。此外,还强调了适当的模型评估和误差估计的重要性。