Korea Research Institute of Chemical Technology (KRICT), Republic of Korea.
Department of Chemistry, Gwangju Institute of Science and Technology (GIST), Republic of Korea.
Chem Commun (Camb). 2022 Jun 9;58(47):6729-6732. doi: 10.1039/d2cc01764d.
Data representation forms a feature space where forms data distribution that is one of the key factors determining the prediction accuracy of machine learning (ML). In particular, the data representation is crucial to handle small and biased training datasets, which is the main challenge of ML in chemical applications. In this paper, we propose a data-agnostic representation method that automatically and universally generates a vector-shaped and target-specified representation of crystal structures. By employing the new materials representation of the proposed method, the prediction capabilities of ML algorithms were highly improved on small training datasets and transfer learning tasks. Moreover, the prediction accuracies of ML algorithms were improved by 28.89-30.87% in extrapolation problems to predict the physical properties of the materials in unknown material groups. The source code of EMRL is publicly available at https://github.com/ngs00/emrl/tree/master/EMRL.
数据表示形式构成了特征空间,其中数据形式的分布是确定机器学习 (ML) 预测准确性的关键因素之一。特别是,数据表示对于处理小型和有偏差的训练数据集至关重要,这是 ML 在化学应用中的主要挑战。在本文中,我们提出了一种与数据无关的表示方法,该方法可以自动且普遍地为晶体结构生成向量形状和目标指定的表示。通过采用所提出方法的新材料表示,ML 算法在小型训练数据集和迁移学习任务上的预测能力得到了极大提高。此外,通过在未知材料组中预测材料的物理性质的外推问题,ML 算法的预测精度提高了 28.89-30.87%。EMRL 的源代码可在 https://github.com/ngs00/emrl/tree/master/EMRL 上获得。