Chen Junlin, He Yuli, Liang Yuexia, Wang Wenjia, Duan Xiong
College of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing, 100083, China.
School of Geographical Sciences, China West Normal University, Nanchong, 637009, Sichuan, China.
Sci Rep. 2024 Oct 5;14(1):23176. doi: 10.1038/s41598-024-74469-3.
The gross calorific value (GCV) of coal is an important parameter for evaluating coal quality, and regression analysis methods can be used to predict GCV. In this study, we proposed a GCV prediction model based on cubist regression. To develop a good regression model, feature selection of input variables was performed using a correlation analysis and a recursive feature elimination algorithm. Thus, in this study, we determined three sets of variables as the optimal combination for regression models: proximate analysis variables (Set 1: moisture, standard ash, and volatile matter), element analysis variables (Set 2: carbon, sulfur, and oxygen), and comprehensive index variables (Set 3: carbon, volatile matter, standard ash, sulfur, moisture, and hydrogen). Results for comparison with multiple linear regression, random forest regression, and numerous previous prediction models, such as gradient boosting regression tree, support vector regression (SVR), backpropagation neural networks, and particle swarm optimization-artificial neural network (PSO-ANN), indicate that these seven regression models have the best fitting effect on the comprehensive index variables among the three sets of input variables. The cubist model showed higher prediction accuracy and lower error than most other models (R, mean absolute error, root mean square error, and average absolute relative deviation percentage values are 0.990, 0.476, 0.668, and 0.086% for the proximate analysis variables; 0.992, 0.381, 0.596, and 0.140% for element analysis variables; and 0.999, 0.161, 0.219, and 0.087% for comprehensive index variables, respectively). The cubist model combines the advantages of decision tree and linear regression, which not only enables it to perform well in terms of accuracy but also makes the model highly interpretable because it is based on multiple sublinear equations. In addition, the cubist model shows obvious advantages in terms of running speed, especially compared with SVR and PSO-ANN, which require complex parameter optimization. In summary, the cubist model considers the prediction accuracy, model interpretability, and computational efficiency as well as provides a new and effective method for GCV prediction.
煤的高位发热量(GCV)是评估煤质的一个重要参数,回归分析方法可用于预测GCV。在本研究中,我们提出了一种基于Cubist回归的GCV预测模型。为了建立一个良好的回归模型,使用相关分析和递归特征消除算法对输入变量进行特征选择。因此,在本研究中,我们确定了三组变量作为回归模型的最优组合:工业分析变量(第1组:水分、标准灰分和挥发分)、元素分析变量(第2组:碳、硫和氧)以及综合指标变量(第3组:碳、挥发分、标准灰分、硫、水分和氢)。与多元线性回归、随机森林回归以及众多先前的预测模型(如梯度提升回归树、支持向量回归(SVR)、反向传播神经网络和粒子群优化 - 人工神经网络(PSO - ANN))的比较结果表明,这七个回归模型在三组输入变量中对综合指标变量具有最佳拟合效果。Cubist模型比大多数其他模型显示出更高的预测精度和更低的误差(对于工业分析变量,R、平均绝对误差、均方根误差和平均绝对相对偏差百分比值分别为0.990、0.476、0.668和0.086%;对于元素分析变量,分别为0.992、0.381、0.596和0.140%;对于综合指标变量,分别为0.999、0.161、0.219和0.087%)。Cubist模型结合了决策树和线性回归的优点,这不仅使其在准确性方面表现良好,而且由于它基于多个子线性方程,使得模型具有高度可解释性。此外,Cubist模型在运行速度方面显示出明显优势,特别是与需要复杂参数优化的SVR和PSO - ANN相比。总之,Cubist模型兼顾了预测精度、模型可解释性和计算效率,为GCV预测提供了一种新的有效方法。