Cihan Pınar, Alfarra Fatma, Kurtulus Ozcan H, Ciner Mirac Nur, Ongen Atakan
Corlu Engineering Faculty, Department of Computer Engineering, Tekirdag Namık Kemal University, 59860, Çorlu, Tekirdag, Turkey.
Engineering Faculty, Department of Environmental Engineering, Istanbul University-Cerrahpaşa, 34320, Avcilar, Istanbul, Turkey.
J Environ Manage. 2025 Sep;391:126521. doi: 10.1016/j.jenvman.2025.126521. Epub 2025 Jul 19.
Accurately predicting syngas composition is essential for optimizing energy production and ensuring environmental sustainability. Despite the growing use of machine learning techniques in this field, publicly available datasets remain limited, and existing datasets contain relatively few samples. To bridge this gap, we generated a comprehensive dataset of 3748 samples under controlled laboratory conditions and publicly shared it on Kaggle (https://www.kaggle.com/datasets/miracnurciner/gasification-dataset). This study aims to identify the most successful machine learning model for predicting H and CH gas concentrations by evaluating nine models: Random Forest (RF), Linear Regression (LR), Decision Tree (DT), Support Vector Regression (Linear and RBF), K-Nearest Neighbors (KNN), Gradient Boosting Regressor (GBR), XGBoost, CatBoost, and LightGBM. Model performance was assessed using multiple metrics, including the coefficient of determination (R), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and explained variance score (EVS). The Friedman test was applied to evaluate the statistical significance of performance differences among the models. The results show that the KNN model achieved the highest predictive performance for both H (R = 0.987, RMSE = 1.253) and CH (R = 0.979, RMSE = 0.920). Friedman test shows that the performance differences between the models are statistically significant (p < 0.001). By integrating Shapley Additive Explanations (SHAP) into the model, the contribution of each feature to the prediction results is clarified. SHAP analysis highlights that temperature and time are the main features affecting H and CH gas. This study highlights the potential of machine learning techniques for biomass gas prediction and advocates for integrating Explainable AI (XAI) methods, establishing a robust foundation for future research. Furthermore, by providing a large, publicly available dataset, this research significantly advances studies in syngas composition prediction.
准确预测合成气成分对于优化能源生产和确保环境可持续性至关重要。尽管机器学习技术在该领域的应用日益广泛,但公开可用的数据集仍然有限,且现有数据集包含的样本相对较少。为了弥补这一差距,我们在可控的实验室条件下生成了一个包含3748个样本的综合数据集,并在Kaggle(https://www.kaggle.com/datasets/miracnurciner/gasification-dataset)上公开分享。本研究旨在通过评估九种模型来确定预测氢气(H)和甲烷(CH)气体浓度最成功的机器学习模型:随机森林(RF)、线性回归(LR)、决策树(DT)、支持向量回归(线性和径向基函数)、K近邻(KNN)、梯度提升回归器(GBR)、XGBoost、CatBoost和LightGBM。使用多种指标评估模型性能,包括决定系数(R)、均方根误差(RMSE)、平均绝对误差(MAE)、平均绝对百分比误差(MAPE)和解释方差得分(EVS)。应用弗里德曼检验来评估模型之间性能差异的统计显著性。结果表明,KNN模型在预测氢气(R = 0.987,RMSE = 1.253)和甲烷(R = 0.979,RMSE = 0.920)方面均取得了最高的预测性能。弗里德曼检验表明,模型之间的性能差异具有统计学显著性(p < 0.001)。通过将夏普利值(SHAP)集成到模型中,阐明了每个特征对预测结果的贡献。SHAP分析突出表明,温度和时间是影响氢气和甲烷气体的主要特征。本研究突出了机器学习技术在生物质气预测方面的潜力,并提倡集成可解释人工智能(XAI)方法,为未来研究奠定坚实基础。此外,通过提供一个大型的公开可用数据集,本研究显著推进了合成气成分预测的研究。