Energy and Materials Division, Toyota Research Institute, Los Altos, USA.
Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, USA.
Sci Rep. 2022 Mar 18;12(1):4694. doi: 10.1038/s41598-022-08413-8.
Sequential learning for materials discovery is a paradigm where a computational agent solicits new data to simultaneously update a model in service of exploration (finding the largest number of materials that meet some criteria) or exploitation (finding materials with an ideal figure of merit). In real-world discovery campaigns, new data acquisition may be costly and an optimal strategy may involve using and acquiring data with different levels of fidelity, such as first-principles calculation to supplement an experiment. In this work, we introduce agents which can operate on multiple data fidelities, and benchmark their performance on an emulated discovery campaign to find materials with desired band gap values. The fidelities of data come from the results of DFT calculations as low fidelity and experimental results as high fidelity. We demonstrate performance gains of agents which incorporate multi-fidelity data in two contexts: either using a large body of low fidelity data as a prior knowledge base or acquiring low fidelity data in-tandem with experimental data. This advance provides a tool that enables materials scientists to test various acquisition and model hyperparameters to maximize the discovery rate of their own multi-fidelity sequential learning campaigns for materials discovery. This may also serve as a reference point for those who are interested in practical strategies that can be used when multiple data sources are available for active or sequential learning campaigns.
序贯学习在材料发现中的应用是一种范例,其中计算代理会请求新数据,以同时更新模型,以实现探索(找到满足某些标准的最大数量的材料)或利用(找到具有理想优值的材料)。在实际的发现活动中,新数据的获取可能很昂贵,并且最佳策略可能涉及使用和获取具有不同保真度的数据,例如第一性原理计算来补充实验。在这项工作中,我们引入了可以在多个数据保真度下运行的代理,并在模拟发现活动中对其进行基准测试,以找到具有所需带隙值的材料。数据的保真度来自 DFT 计算的结果(低保真度)和实验结果(高保真度)。我们在两个方面展示了整合多保真度数据的代理的性能提升:要么使用大量的低保真度数据作为先验知识库,要么在获取低保真度数据的同时获取实验数据。这一进展提供了一个工具,使材料科学家能够测试各种获取和模型超参数,以最大化他们自己的多保真度序贯学习活动的发现率。对于那些对多数据源可用于主动或序贯学习活动时可以使用的实际策略感兴趣的人来说,这也可能是一个参考点。