Ahmed Wesam, Wani Mudasir Ahmad, Plawiak Pawel, Meshoul Souham, Mahmoud Amena, Hammad Mohamed
Department of Information Technology, Faculty of Computers and Artificial Intelligence, Hurghada University, Hurghada, Egypt.
EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Riyadh, 11586, Saudi Arabia.
Sci Rep. 2025 Jul 24;15(1):26879. doi: 10.1038/s41598-025-12353-4.
Education is crucial for the growth of effective life skills and the allocation of needed resources. Higher education institutions are adopting advanced technologies, such as artificial intelligence (AI), to enhance traditional teaching methods. Predicting academic performance has become increasingly important, improving university rankings and expanding student opportunities. This study addresses challenges in performance analysis, quality education delivery, and student evaluation through machine learning (ML) models. Ten regression models including K-Nearest Neighbors Regressor, Linear Regression, CatBoost, XGBoost, AdaBoost, and ensemble voting regression (VR) algorithm based on the top five heterogeneous regressors as base models are employed to predict academic outcomes. Two datasets with distinct feature sets and sizes were used to evaluate the generalizability of the models. The first dataset comprises 10,000 samples and six features focused on study behaviors, prior performance, and extracurricular activities. The second dataset includes 6,607 records and 20 features encompassing academic habits, demographic attributes, and institutional factors such as attendance, teacher quality, and parental involvement. Best model performance goes to the linear regression in standalone ML models. Then, the proposed ensemble VR model was built using weighted averages based on the performances of the base models. The local interpretable model-agnostic explanations (LIME) and SHapley Additive exPlanations (SHAP) are then used to explain the predictions produced by the proposed ensemble VR model. For the first dataset, the VR model achieved an RMSE of 0.1050, MAE of 0.0837, and R² of 0.9890. On the second, more complex dataset, the VR model also performed best with an R² of 0.7716 using the full feature set, highlighting its robustness and adaptability across diverse academic contexts. These results offer actionable insights for educators, administrators, and policymakers to better understand student performance drivers and support data-informed educational strategies.
教育对于有效生活技能的培养和所需资源的分配至关重要。高等教育机构正在采用人工智能(AI)等先进技术来改进传统教学方法。预测学业成绩变得越来越重要,这有助于提高大学排名并增加学生机会。本研究通过机器学习(ML)模型解决了绩效分析、优质教育提供和学生评估方面的挑战。使用了十种回归模型,包括K近邻回归器、线性回归、CatBoost、XGBoost、AdaBoost以及基于前五个异构回归器作为基础模型的集成投票回归(VR)算法来预测学业成果。使用两个具有不同特征集和大小的数据集来评估模型的泛化能力。第一个数据集包含10000个样本和六个侧重于学习行为、先前成绩和课外活动的特征。第二个数据集包括6607条记录和20个特征,涵盖学术习惯、人口统计学属性以及诸如出勤率、教师质量和家长参与度等机构因素。在独立的ML模型中,最佳模型性能由线性回归实现。然后,基于基础模型的性能使用加权平均值构建了所提出的集成VR模型。接着使用局部可解释模型无关解释(LIME)和SHapley加法解释(SHAP)来解释所提出的集成VR模型产生的预测。对于第一个数据集,VR模型的均方根误差(RMSE)为0.1050,平均绝对误差(MAE)为0.0837,决定系数(R²)为0.9890。在第二个更复杂的数据集上,使用完整特征集时,VR模型的R²为0.7716,同样表现最佳,突出了其在不同学术背景下的稳健性和适应性。这些结果为教育工作者、管理人员和政策制定者提供了可操作的见解,以便更好地理解学生成绩的驱动因素并支持基于数据的教育策略。