Yu Jiahao, Zhao Yongman, Pan Rongshun, Zhou Xue, Wei Zikai
School of Mechanical and Electrical Engineering, Shihezi University, Shihezi832003, China.
Key Laboratory of Modern Agricultural Machinery, Shihezi University, Shihezi832003, China.
ACS Omega. 2023 Jan 13;8(3):3078-3090. doi: 10.1021/acsomega.2c06324. eCollection 2023 Jan 24.
The study of superconductors' critical temperature ( ) has been a matter of interest. A method combining a two-layer feature selection (TL) and Optuna-Stacking ensemble learning model is proposed in the study for predicting from physicochemical components. Since most machine-learning models require a large amount of prior knowledge to construct the feature vectors associated with manually, they may contain redundant or invalid features that adversely affect the analysis and prediction of . The TL model combines the advantages of filtered and packed feature selection. In the first layer, feature importance is ranked by "SHapley Additive explain (SHAP)" in combination with CatBoost, followed by maximum mutual information coefficient (MIC) and distance correlation coefficient (DCC) for initial feature selection in terms of feature importance ranking. The second layer uses a cross-validation-based genetic algorithm (cv-GA) to eliminate the remaining redundant/invalid features. The selected features are fed into the Stacking integrated learning model to achieve prediction of Tc, and the multidimensional hyperparametric optimization of the metamodel is achieved by Optuna, an improved Bayesian hyperparametric optimization framework based on the Tree-structured Parzen Estimator (TPE) and pruning strategy. The model has obvious advantages and generality in terms of prediction performance and feature reduction rate, and it also proves to be suitable for high-temperature superconductor prediction. It provides an efficient and cost-effective method for data-driven superconductor research.
对超导体临界温度( )的研究一直是一个备受关注的问题。该研究提出了一种将双层特征选择(TL)和Optuna-Stacking集成学习模型相结合的方法,用于从物理化学组分预测 。由于大多数机器学习模型需要大量先验知识来手动构建与 相关的特征向量,这些特征向量可能包含冗余或无效特征,从而对 的分析和预测产生不利影响。TL模型结合了过滤式和包装式特征选择的优点。在第一层,通过结合CatBoost的“SHapley加法解释(SHAP)”对特征重要性进行排序,然后根据特征重要性排序,使用最大互信息系数(MIC)和距离相关系数(DCC)进行初始特征选择。第二层使用基于交叉验证的遗传算法(cv-GA)来消除剩余的冗余/无效特征。将所选特征输入到Stacking集成学习模型中以实现对Tc的预测,并通过Optuna实现元模型的多维超参数优化,Optuna是一个基于树状结构帕曾估计器(TPE)和剪枝策略的改进贝叶斯超参数优化框架。该模型在预测性能和特征约简率方面具有明显优势和通用性,也证明适用于高温超导体 的预测。它为数据驱动的超导体研究提供了一种高效且经济的方法。