Suaza-Medina Mario, Peñabaena-Niebles Rita, Jubiz-Diaz Maria
Department of Informatics and Computer Science, Universidad de Zaragoza, Maria de Luna 1, Zaragoza, 50018, Spain.
Department of Industrial Engineering, Universidad del Norte, Km 5 Via Puerto Colombia, Barranquilla, 081007, Atlántico, Colombia.
Sci Rep. 2024 Oct 25;14(1):25306. doi: 10.1038/s41598-024-76596-3.
Data are becoming more important in education since they allow for the analysis and prediction of future behaviour to improve academic performance and quality at educational institutions. However, academic performance is affected by regions' conditions, such as demographic, psychographic, socioeconomic and behavioural variables, especially in lagging regions. This paper presents a methodology based on applying nine classification algorithms and Shapley values to identify the variables that influence the performance of the Colombian standardised test: the Saber 11 exam. This study is innovative because, unlike others, it applies to lagging regions and combines the use of EDM and Shapley values to predict students' academic performance and analyse the influence of each variable on academic performance. The results show that the algorithms with the best accuracy are Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, and Gradient Boosting Machine. According to the Shapley values, the most influential variables are the socioeconomic level index, gender, region, location of the educational institution, and age. For Colombia, the results showed that male students from urban educational institutions over 18 years have the best academic performance. Moreover, there are differences in educational quality among the lagging regions. Students from Nariño have advantages over ones from other departments. The proposed methodology allows for generating public policies better aligned with the reality of lagging regions and achieving equity in access to education.
数据在教育领域正变得愈发重要,因为它们有助于对未来行为进行分析和预测,从而提高教育机构的学术表现和质量。然而,学术表现会受到地区条件的影响,如人口统计学、心理统计学、社会经济和行为变量等,在落后地区尤其如此。本文提出了一种基于应用九种分类算法和夏普利值的方法,以识别影响哥伦比亚标准化考试——萨韦尔11考试成绩的变量。本研究具有创新性,因为与其他研究不同,它适用于落后地区,并结合了电子数据挖掘(EDM)和夏普利值的使用来预测学生的学术表现,并分析每个变量对学术表现的影响。结果表明,准确率最高的算法是极端梯度提升机、轻量级梯度提升机和梯度提升机。根据夏普利值,最具影响力的变量是社会经济水平指数、性别、地区、教育机构所在地和年龄。对于哥伦比亚而言,结果显示18岁以上来自城市教育机构的男学生学术表现最佳。此外,落后地区之间的教育质量存在差异。纳里尼奥的学生比其他部门的学生更具优势。所提出的方法有助于制定更符合落后地区实际情况的公共政策,并实现教育机会均等。