• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成式人工智能与人类专业知识:基于案例的合理药物治疗问题生成的比较分析

Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.

作者信息

Güvel Muhammed Cihan, Kıyak Yavuz Selim, Varan Hacer Doğan, Sezenöz Burak, Coşkun Özlem, Uluoğlu Canan

机构信息

Department Medical Pharmacology, Gazi University Faculty of Medicine, Ankara, Turkey.

Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey.

出版信息

Eur J Clin Pharmacol. 2025 Jun;81(6):875-883. doi: 10.1007/s00228-025-03838-2. Epub 2025 Apr 9.

DOI:10.1007/s00228-025-03838-2
PMID:40205076
Abstract

PURPOSE

This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.

METHODS

Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.

RESULTS

No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.

CONCLUSION

Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.

摘要

目的

本研究评估了三种生成式人工智能模型——ChatGPT-4o、Gemini 1.5 Advanced Pro和Claude 3.5 Sonnet——与专家教育工作者相比,在生成基于病例的合理药理学问题方面的表现。

方法

使用一次性提示,生成了60个问题(每个模型20个),涉及原发性高血压和2型糖尿病主题。一个多学科小组根据可用性对问题进行分类(无需修订、需要 minor 或 major 修订,或不可用)。随后,在实际考试环境中,向103名医学生提出了24个由人工智能生成的问题和8个由专家创建的问题。分析了包括正确回答率、区分指数和识别无功能干扰项在内的性能指标。

结果

人工智能生成的问题和专家创建的问题之间未发现统计学上的显著差异,平均正确回答率超过50%,区分指数始终等于或高于0.20。Claude生成的无错误项目比例最高(12/20),而ChatGPT展示的不可用项目最少(5/20)。专家修订每个由人工智能生成的问题大约需要一分钟,与手动编写问题相比,效率有了显著提高。尽管如此,60个由人工智能生成的问题中有19个被认为不可用,这凸显了专家监督的必要性。

结论

大语言模型可以极大地加速医学教育中高质量评估问题的开发。然而,专家评审对于解决可靠性和有效性方面的不足仍然至关重要。将人工智能驱动的效率与严格的专家验证相结合的混合模型,可能为提高教育成果提供最佳方法。

相似文献

1
Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.生成式人工智能与人类专业知识:基于案例的合理药物治疗问题生成的比较分析
Eur J Clin Pharmacol. 2025 Jun;81(6):875-883. doi: 10.1007/s00228-025-03838-2. Epub 2025 Apr 9.
2
AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.用于医学教育的人工智能生成与人工生成的多项选择题:一项在高风险考试中的队列研究
BMC Med Educ. 2025 Feb 8;25(1):208. doi: 10.1186/s12909-025-06796-6.
3
Quality assurance and validity of AI-generated single best answer questions.人工智能生成的最佳单一答案问题的质量保证与有效性
BMC Med Educ. 2025 Feb 25;25(1):300. doi: 10.1186/s12909-025-06881-w.
4
AI in radiography education: Evaluating multiple-choice questions difficulty and discrimination.放射学教育中的人工智能:评估多项选择题的难度和区分度。
J Med Imaging Radiat Sci. 2025 Mar 28;56(4):101896. doi: 10.1016/j.jmir.2025.101896.
5
Can ChatGPT Generate Acceptable Case-Based Multiple-Choice Questions for Medical School Anatomy Exams? A Pilot Study on Item Difficulty and Discrimination.ChatGPT能否生成适用于医学院解剖学考试的基于案例的多项选择题?关于题目难度和区分度的初步研究。
Clin Anat. 2025 May;38(4):505-510. doi: 10.1002/ca.24271. Epub 2025 Mar 24.
6
ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam.ChatGPT 生成选择题:人工智能在合理药物治疗考试自动试题生成中的应用证据。
Eur J Clin Pharmacol. 2024 May;80(5):729-735. doi: 10.1007/s00228-024-03649-x. Epub 2024 Feb 14.
7
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
8
AI-generated questions for urological competency assessment: a prospective educational study.用于泌尿外科能力评估的人工智能生成问题:一项前瞻性教育研究。
BMC Med Educ. 2025 Apr 25;25(1):611. doi: 10.1186/s12909-025-07202-x.
9
Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比:与眼科住院医师一起对医学知识进行的全面考察
Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.
10
Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.克劳德、ChatGPT、Copilot和Gemini在神经科学不同主题上与学生的表现对比。
Adv Physiol Educ. 2025 Jun 1;49(2):430-437. doi: 10.1152/advan.00093.2024. Epub 2025 Jan 17.

本文引用的文献

1
ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review.ChatGPT 提示在医学教育中生成多项选择题及其有效性的证据:文献综述。
Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065.
2
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
3
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.
零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
4
Large language models for generating medical examinations: systematic review.生成医学检查的大型语言模型:系统评价。
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
5
The Potential Applications and Challenges of ChatGPT in the Medical Field.ChatGPT在医学领域的潜在应用与挑战
Int J Gen Med. 2024 Mar 5;17:817-826. doi: 10.2147/IJGM.S456659. eCollection 2024.
6
The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives.大语言模型在医疗应用方面的突破:1 年时间线与展望。
J Med Syst. 2024 Feb 17;48(1):22. doi: 10.1007/s10916-024-02045-3.
7
ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam.ChatGPT 生成选择题:人工智能在合理药物治疗考试自动试题生成中的应用证据。
Eur J Clin Pharmacol. 2024 May;80(5):729-735. doi: 10.1007/s00228-024-03649-x. Epub 2024 Feb 14.
8
Natural language processing in the era of large language models.大语言模型时代的自然语言处理
Front Artif Intell. 2024 Jan 12;6:1350306. doi: 10.3389/frai.2023.1350306. eCollection 2023.
9
ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom).I'm unable to answer that question. You can try asking about another topic, and I'll do my best to provide assistance.
PLoS One. 2023 Aug 29;18(8):e0290691. doi: 10.1371/journal.pone.0290691. eCollection 2023.
10
The Role of Large Language Models in Medical Education: Applications and Implications.大语言模型在医学教育中的作用:应用与启示
JMIR Med Educ. 2023 Aug 14;9:e50945. doi: 10.2196/50945.