Takahara Wataru, Baba Ryuto, Harashima Yosuke, Takayama Tomoaki, Takasuka Shogo, Yamaguchi Yuichi, Kudo Akihiko, Fujii Mikiya
Division of Materials Science, Nara Institute of Science and Technology, Ikoma-shi, Nara-ken 630-0192, Japan.
Data Science Center, Nara Institute of Science and Technology, Ikoma-shi, Nara-ken 630-0192, Japan.
ACS Omega. 2025 Apr 10;10(15):14626-14639. doi: 10.1021/acsomega.4c06997. eCollection 2025 Apr 22.
In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.
在数据驱动的材料开发领域,当应用机器学习方法时,数据点集中在某些区域的数据集不平衡常常会给构建回归模型带来困难。面临此类困难的无机功能材料的一个例子是光催化剂。因此,即使数据集存在不平衡,先进的数据驱动方法也有望有助于高效地开发新型光催化材料。我们提出了一种两阶段机器学习模型,旨在处理不平衡数据集而无需数据稀疏化。在本研究中,我们使用了两种呈现不平衡的数据集:材料项目数据集(因其公共领域数据而公开共享)和内部金属硫化物光催化剂数据集(由于实验数据的保密性而未公开共享)。这种两阶段机器学习模型由以下两个部分组成:第一个回归模型,用于定量预测目标;第二个分类模型,用于确定第一个回归模型预测值的可靠性。我们还基于所提出的两阶段机器学习模型提出了一种与实验条件相关的变量搜索方案。该方案是为光催化剂探索而设计的,因为考虑到这些条件的最佳变量集未知,所以将实验条件作为变量。与单阶段模型相比,所提出的两阶段机器学习模型提高了目标的预测精度。