Jose Ashna, Devijver Emilie, Jakse Noel, Poloni Roberta
SIMaP, Grenoble-INP, CNRS, University of Grenoble Alpes, Grenoble 38042, France.
LiG, Grenoble-INP, CNRS, University of Grenoble Alpes, Grenoble 38042, France.
J Am Chem Soc. 2024 Mar 6;146(9):6134-6144. doi: 10.1021/jacs.3c13687. Epub 2024 Feb 25.
In recent data-driven approaches to material discovery, scenarios where target quantities are expensive to compute and measure are often overlooked. In such cases, it becomes imperative to construct a training set that includes the most diverse, representative, and informative samples. Here, a novel regression tree-based active learning algorithm is employed for such a purpose. It is applied to predict the band gap and adsorption properties of metal-organic frameworks (MOFs), a novel class of materials that results from the virtually infinite combinations of their building units. Simpler and low dimensional descriptors, such as those based on stoichiometric and geometric properties, are used to compute the feature space for this model owing to their ability to better represent MOFs in the low data regime. The partitions given by a regression tree constructed on the labeled part of the data set are used to select new samples to be added to the training set, thereby limiting its size while maximizing the prediction quality. Tests on the QMOF, hMOF, and dMOF data sets reveal that our method constructs small training data sets to learn regression models that predict the target properties more efficiently than existing active learning approaches, and with lower variance. Specifically, our active learning approach is highly beneficial when labels are unevenly distributed in the descriptor space and when the label distribution is imbalanced, which is often the case for real world data. The regions defined by the tree help in revealing patterns in the data, thereby offering a unique tool to efficiently analyze complex structure-property relationships in materials and accelerate materials discovery.
在近期基于数据驱动的材料发现方法中,目标量计算和测量成本高昂的情况常常被忽视。在这种情况下,构建一个包含最多样化、最具代表性和信息量最大的样本的训练集就变得势在必行。在此,一种基于回归树的新型主动学习算法被用于此目的。它被应用于预测金属有机框架(MOF)的带隙和吸附特性,MOF是一类新型材料,由其构建单元的几乎无限组合产生。由于能够在低数据量情况下更好地表示MOF,更简单和低维的描述符,如基于化学计量和几何性质的描述符,被用于计算该模型的特征空间。在数据集的标记部分构建的回归树给出的划分用于选择要添加到训练集中的新样本,从而在限制其大小的同时最大化预测质量。对QMOF、hMOF和dMOF数据集的测试表明,我们的方法构建了小的训练数据集来学习回归模型,该模型比现有的主动学习方法更有效地预测目标特性,且方差更低。具体而言,当标签在描述符空间中分布不均匀以及标签分布不平衡时,我们的主动学习方法非常有益,而这在现实世界数据中经常出现。树所定义的区域有助于揭示数据中的模式,从而提供了一个独特的工具来有效分析材料中复杂的结构-性质关系并加速材料发现。