文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

大语言模型在医学教育中的应用:比较 ChatGPT 与人工生成的考试题目。

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.

出版信息

Acad Med. 2024 May 1;99(5):508-512. doi: 10.1097/ACM.0000000000005626. Epub 2023 Dec 28.


DOI:10.1097/ACM.0000000000005626
PMID:38166323
Abstract

PROBLEM: Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. APPROACH: The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. OUTCOMES: The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. NEXT STEPS: Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.

摘要

问题:编写医学考试试题既耗时又耗力,但精心编写的试题可用于促进学习,这已被证明对学生的学习有积极影响。因此,使用大型语言模型(如 ChatGPT)自动生成高质量的试题将是理想的。然而,目前尚无研究比较学生在大型语言模型生成的试题和人类编写的试题上的表现。

方法:作者比较了 ChatGPT(大型语言模型试题)生成的试题和医学教育工作者编写的试题(人类试题)中学生的表现。共创建了两组 25 道多项选择题(MCQ),每题有 5 个答案选项,其中 1 个是正确的。第一组问题由一位经验丰富的医学教育工作者编写,第二组问题由 ChatGPT 3.5 在确定学习目标并从人类问题中提取一些规范后编写。学生在形成性纸质测试中以随机顺序回答所有问题,该测试在最终总结性神经生理学考试(2023 年夏季)之前提供。对于每个问题,学生还表明他们认为该问题是由人类还是 ChatGPT 编写的。

结果:最终数据集包括 161 名参与者和 46 道 MCQ(25 道人类试题和 21 道大型语言模型试题)。两组试题的项目难度没有统计学上的显著差异,但人类试题的区分度明显高于大型语言模型试题(均值 =.36,标准差 [SD] =.09 与均值 =.24,SD =.14;P =.001)。平均而言,学生正确识别了 57%的试题来源(人类或大型语言模型)。

下一步:未来的研究应在其他情境(例如其他医学科目、学期、国家和语言)中复制该研究程序。此外,应研究大型语言模型是否适合生成不同类型的试题,例如关键特征试题。

相似文献

[1]
Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.

Acad Med. 2024-5-1

[2]
Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.

JMIR Med Educ. 2024-7-23

[3]
Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial.

J Med Internet Res. 2024-8-20

[4]
A novel student-led approach to multiple-choice question generation and online database creation, with targeted clinician input.

Teach Learn Med. 2015

[5]
Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Cureus. 2024-9-17

[6]
Answering questions in a co-created formative exam question bank improves summative exam performance, while students perceive benefits from answering, authoring, and peer discussion: A mixed methods analysis of PeerWise.

Pharmacol Res Perspect. 2021-8

[7]
Formative student-authored question bank: perceptions, question quality and association with summative performance.

Postgrad Med J. 2017-9-2

[8]
Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions.

Ann Anat. 2024-6

[9]
Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT.

Med Teach. 2024-8

[10]
Leveraging Large Language Models (LLM) for the Plastic Surgery Resident Training: Do They Have a Role?

Indian J Plast Surg. 2023-8-28

引用本文的文献

[1]
DeepGut: A collaborative multimodal large language model framework for digestive disease assisted diagnosis and treatment.

World J Gastroenterol. 2025-8-21

[2]
Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.

Med Educ Online. 2025-12

[3]
AI-generated multiple-choice questions in health science education: Stakeholder perspectives and implementation considerations.

Curr Res Physiol. 2025-8-1

[4]
Empowering tomorrow's public health researchers and clinicians to develop digital health interventions using chatbots, virtual reality, and other AI technologies.

Front Public Health. 2025-7-8

[5]
Large Language Models in Medicine: Applications, Challenges, and Future Directions.

Int J Med Sci. 2025-5-31

[6]
Artificial Intelligence Use in Medical Education: Best Practices and Future Directions.

Curr Urol Rep. 2025-5-29

[7]
Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education.

NPJ Digit Med. 2025-5-27

[8]
Delving into the Practical Applications and Pitfalls of Large Language Models in Medical Education: Narrative Review.

Adv Med Educ Pract. 2025-4-18

[9]
Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study.

J Dent Sci. 2025-4

[10]
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.

JMIR Med Educ. 2025-4-10

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索