Hu Yang, Sandt Roland, Spatschek Robert
Institute of Energy Materials and Devices IMD-1, Forschungszentrum Jülich GmbH, 52428, Jülich, Germany.
Georesources and Materials Engineering, RWTH Aachen University, 52062, Aachen, Germany.
Sci Rep. 2024 Sep 3;14(1):20449. doi: 10.1038/s41598-024-71342-1.
Many potential use cases for machine learning in chemistry and materials science suffer from small dataset sizes, which demands special care for the model design in order to deliver reliable predictions. Hence, feature selection as the key determinant for dataset design is essential here. We propose a practical and efficient feature filter strategy to determine the best input feature candidates. We illustrate this strategy for the prediction of adsorption energies based on a public dataset and sublimation enthalpies using an in-house training dataset. The input of adsorption energies reduces the feature space from 12 dimensions to two and still delivers accurate results. For the sublimation enthalpies, three input configurations are filtered from 14 possible configurations with different dimensions for further productive predictions as being most relevant by using our feature filter strategy. The best extreme gradient boosting regression model possesses a good performance and is evaluated from statistical and theoretical perspectives, reaching a level of accuracy comparable to density functional theory computations and allowing for physical interpretations of the predictions. Overall, the results indicate that the feature filter strategy can help interdisciplinary scientists without rich professional AI knowledge and limited computational resources to establish a reliable small training dataset first, which may make the final machine learning model training easier and more accurate, avoiding time-consuming hyperparameter explorations and improper feature selection.
化学和材料科学中机器学习的许多潜在用例都面临数据集规模较小的问题,这就要求在模型设计时格外小心,以便做出可靠的预测。因此,特征选择作为数据集设计的关键决定因素在此至关重要。我们提出了一种实用且高效的特征过滤策略,以确定最佳的输入特征候选集。我们基于一个公共数据集展示了该策略用于预测吸附能,并使用内部训练数据集展示了用于预测升华焓的情况。吸附能的输入将特征空间从12维减少到2维,并且仍然能给出准确的结果。对于升华焓,通过使用我们的特征过滤策略,从14种不同维度的可能配置中筛选出三种输入配置,作为最相关的配置用于进一步有效的预测。最佳的极端梯度提升回归模型具有良好的性能,并从统计和理论角度进行了评估,达到了与密度泛函理论计算相当的准确度水平,并且能够对预测结果进行物理解释。总体而言,结果表明特征过滤策略可以帮助没有丰富专业人工智能知识和有限计算资源的跨学科科学家首先建立一个可靠的小训练数据集,这可能会使最终的机器学习模型训练更容易、更准确,避免耗时的超参数探索和不当的特征选择。