Ma Xiang, Pan Wei, Yu Xiao-Ning
The Affiliated Yantai Stomatological Hospital of Binzhou Medical University, Yantai, 264000, China.
Department of Periodontology, The Affiliated Yantai Stomatological Hospital of Binzhou Medical University, 148 North Road, YanTai, China.
BMC Med Educ. 2025 Jul 23;25(1):1099. doi: 10.1186/s12909-025-07706-6.
This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations.
A randomized controlled trial was conducted with 126 undergraduate dental students, who were divided into AI (n = 63) and human (n = 63) test groups. The AI-generated examination was developed using GPT-4, while the human examination was derived from the 2024 institutional final exam. Both assessments covered identical content from Periodontology (5th Edition) and included 90 multiple-choice questions (MCQs) across five formats: A1: Single-sentence best choice; A2: Case summary best choice; A3: Case group best choice; A4: Case chain best choice; X: Multiple correct options. Psychometric properties (reliability, validity, difficulty, discrimination) and student feedback were analyzed using split-half reliability, content coverage analysis, factor analysis, and 5-point Likert scales.
The AI examination demonstrated superior content coverage (81.3% vs. 72.4%) and significantly higher total scores (79.34 ± 6.93 vs. 73.17 ± 9.57, p = 0.027). However, it showed significantly lower discrimination indices overall (0.35 vs. 0.49, p = 0.004). Both examinations exhibited adequate split-half reliability (AI = 0.81, human = 0.84) and comparable difficulty distributions (AI: easy 40.0%, moderate 46.7%, difficult 13.3%; human: easy 30.0%, moderate 50.0%, difficult 20.0%; p = 0.274). Student feedback revealed significantly lower ratings for the AI test in terms of perceived difficulty appropriateness (3.53 ± 1.03 vs. 4.19 ± 0.76, p < 0.001), knowledge coverage (3.67 ± 0.89 vs. 4.19 ± 0.72, p < 0.001), and learning inspiration (3.79 ± 0.90 vs. 4.25 ± 0.67, p = 0.001).
While AI-generated examinations improve content breadth and efficiency, their limited clinical contextualization and discrimination constrain their use in high-stakes applications. A hybrid "AI-human collaborative generation" framework, integrating medical knowledge graphs for contextual optimization, is proposed to balance automation with assessment precision. This study provides empirical evidence for the role of AI in enhancing dental education assessment systems.
本研究系统评估人工智能(AI)生成的口腔牙周病学教育考试的表现,将其质量、学生成绩及实际应用与人工设计的考试进行比较。
对126名牙科本科学生进行随机对照试验,将他们分为AI组(n = 63)和人工组(n = 63)。使用GPT-4开发AI生成的考试,而人工考试源自2024年学校期末考试。两种评估涵盖《牙周病学》(第5版)相同内容,包括90道选择题,有五种题型:A1:单句最佳选择题;A2:病例摘要最佳选择题;A3:病例组最佳选择题;A4:病例链最佳选择题;X:多个正确选项。使用分半信度、内容覆盖分析、因素分析和5点李克特量表分析心理测量特性(信度、效度、难度、区分度)及学生反馈。
AI考试显示出更好的内容覆盖(81.3%对72.4%)和显著更高的总分(79.34±6.93对73.17±9.57,p = 0.027)。然而,其整体区分指数显著更低(0.35对0.49,p = 0.004)。两种考试均表现出足够的分半信度(AI = 0.81,人工 = 0.84)和相当的难度分布(AI:容易40.0%,中等46.7%,困难13.3%;人工:容易30.0%,中等50.0%,困难20.0%;p = 0.274)。学生反馈显示,在感知难度适宜性(3.53±1.03对4.19±0.76,p < 0.001)、知识覆盖(3.67±0.89对4.19±0.72,p < 0.001)和学习启发(3.79±0.90对4.25±0.67,p = 0.001)方面,学生对AI考试的评分显著更低。
虽然AI生成的考试提高了内容广度和效率,但其有限的临床情境化和区分度限制了它们在高风险应用中的使用。提出一种混合的“AI-人工协作生成”框架,整合医学知识图谱以进行情境优化,以平衡自动化与评估精度。本研究为AI在加强牙科教育评估系统中的作用提供了实证依据。