Lee Sangjoon, Chen Clio, Garcia Griheydi, Oliynyk Anton
Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10027, United States.
Department of Chemistry and Biochemistry, Manhattan College, Riverdale, NY 10471, United States.
Data Brief. 2024 Feb 9;53:110178. doi: 10.1016/j.dib.2024.110178. eCollection 2024 Apr.
Materials informatics employs data-driven approaches for analysis and discovery of materials. Features also referred to as descriptors are essential in generating reliable and accurate machine-learning models. While general data can be obtained through public and commercial sources, features must be tailored to specific applications. Common featurizers suitable for generic chemical problems may not be effective in features-property mapping in solid-state materials with ML models. Here, we have assembled the Oliynyk property list for compositional feature generation, which performs well on limited datasets (50 to 1000 training data points) in the solid-state materials domain. The dataset contains 98 elemental features for atomic numbers from 1 to 92, including thermodynamic properties, electronic structure data, size, electronegativity, and bulk properties such as melting point, density, and conductivity. The dataset has been utilized peer-reviewed publications in predicting material hardness, classification, discovery of novel Heusler compounds, band gap prediction, and determining the site preference of atoms using machine learning models including support vector machines, random forests for classification, and support vector regression for regression problems. We have compiled the dataset by parsing data from publicly available databases and literature and further supplementing it by interpolating values with Gaussian process regression.
材料信息学采用数据驱动的方法来分析和发现材料。特征(也称为描述符)对于生成可靠且准确的机器学习模型至关重要。虽然一般数据可以通过公共和商业来源获得,但特征必须针对特定应用进行定制。适用于一般化学问题的常见特征提取器在使用机器学习模型进行固态材料的特征-性质映射时可能无效。在此,我们组装了用于生成组成特征的奥利尼克性质列表,它在固态材料领域的有限数据集(50至1000个训练数据点)上表现良好。该数据集包含从1到92号原子序数的98个元素特征,包括热力学性质、电子结构数据、尺寸、电负性以及诸如熔点、密度和电导率等体相性质。该数据集已被用于同行评审的出版物中,通过包括支持向量机、用于分类的随机森林以及用于回归问题的支持向量回归等机器学习模型来预测材料硬度、分类、发现新型赫斯勒化合物、带隙预测以及确定原子的位置偏好。我们通过解析来自公开可用数据库和文献的数据,并通过高斯过程回归对值进行插值来进一步补充,从而汇编了该数据集。