Department of Informatics, Blanchardstown Campus, Technological University Dublin, 15 YV78 Dublin, Ireland.
Department of Informatics, Postdoctoral Institute for Computational Studies, Enfield, NH 03748, USA.
Molecules. 2020 Dec 22;26(1):8. doi: 10.3390/molecules26010008.
In this study, we have investigated quantitative relationships between critical temperatures of superconductive inorganic materials and the basic physicochemical attributes of these materials (also called quantitative structure-property relationships). We demonstrated that one of the most recent studies (titled "A data-driven statistical model for predicting the critical temperature of a superconductor" and published in Computational Materials Science by K. Hamidieh in 2018) reports on models that were based on the dataset that contains 27% of duplicate entries. We aimed to deliver stable models for a properly cleaned dataset using the same modeling techniques (multiple linear regression, MLR, and gradient boosting decision trees, XGBoost). The predictive ability of our best XGBoost model (R2 = 0.924, RMSE = 9.336 using 10-fold cross-validation) is comparable to the XGBoost model by the author of the initial dataset (R2 = 0.920 and RMSE = 9.5 K in ten-fold cross-validation). At the same time, our best model is based on less sophisticated parameters, which allows one to make more accurate interpretations while maintaining a generalizable model. In particular, we found that the highest relative influence is attributed to variables that represent the thermal conductivity of materials. In addition to MLR and XGBoost, we explored the potential of other machine learning techniques (NN, neural networks and RF, random forests).
在这项研究中,我们研究了超导无机材料的临界温度与这些材料的基本物理化学属性(也称为定量结构-性质关系)之间的定量关系。我们证明了最近的一项研究(题为“基于数据集的超导临界温度预测数据驱动统计模型”,发表于 2018 年 K. Hamidieh 在《计算材料科学》上)报告的模型基于包含 27%重复条目的数据集。我们旨在使用相同的建模技术(多元线性回归、MLR 和梯度提升决策树、XGBoost)为经过适当清理的数据集提供稳定的模型。我们最好的 XGBoost 模型(使用 10 折交叉验证的 R2 = 0.924,RMSE = 9.336)的预测能力与初始数据集作者的 XGBoost 模型相当(R2 = 0.920,RMSE = 10 折交叉验证中的 9.5 K)。同时,我们最好的模型基于不太复杂的参数,这允许在保持可推广模型的同时进行更准确的解释。特别是,我们发现,对材料热导率的变量的相对影响最大。除了 MLR 和 XGBoost,我们还探索了其他机器学习技术(NN、神经网络和 RF、随机森林)的潜力。