Department of General Medicine, Juntendo University Faculty of Medicine, Tokyo, Japan.
Department of Community-Oriented Medical Education, Chiba University Graduate School of Medicine, Chiba, Japan.
JMIR Med Educ. 2024 Aug 13;10:e59133. doi: 10.2196/59133.
Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored.
This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings.
Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases.
Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations.
ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.
评估人工智能生成的医学案例的准确性和教育实用性至关重要,但目前对此的研究还不够充分,尤其是对于像 ChatGPT-4(由 OpenAI 开发)这样的大型语言模型生成的案例。
本研究旨在评估 ChatGPT-4 生成的临床病例的教育实用性及其在教育环境中的适用性。
采用收敛混合方法设计,于 2024 年 1 月 8 日至 28 日进行了一项基于网络的调查,评估了 18 个由 ChatGPT-4 生成的日语临床病例。在调查中,使用 6 个主要问题项目来评估生成的临床病例的质量和教育实用性,包括信息质量、信息准确性、教育有用性、临床匹配、术语准确性(TA)和诊断难度。向专门从事普通内科或普通医学且具有医学教育经验的医生征求反馈意见。采用卡方检验和曼-惠特尼 U 检验比较病例之间的差异,并采用线性回归分析与医生经验相关的趋势。对定性反馈进行主题分析,以确定需要改进的领域,并确认病例的教育实用性。
在邀请的 73 名参与者中,有 71 名(97%)做出了回应。这些参与者主要为男性(64/71,90%),实践年限广泛(1976 年至 2017 年),代表了日本各地不同规模的医院。大多数人认为信息质量(平均值 0.77,95%CI 0.75-0.79)和信息准确性(平均值 0.68,95%CI 0.65-0.71)令人满意,这些回答基于二项数据。平均评分为教育有用性 3.55(95%CI 3.49-3.60)、临床匹配 3.70(95%CI 3.65-3.75)、TA 3.49(95%CI 3.44-3.55)和诊断难度 2.34(95%CI 2.28-2.40),基于 5 分制李克特量表。统计分析显示,案例内容质量和相关性存在显著差异(Bonferroni 校正后 P<.001)。参与者建议在生成物理发现、使用自然语言和提高医学 TA 方面进行改进。主题分析强调了需要更清晰的文档记录、临床信息一致性、内容相关性和以患者为中心的病例呈现。
以日语书写的 ChatGPT-4 生成的医学病例具有作为医学教育资源的巨大潜力,在质量和准确性方面被认为是足够的。然而,案例细节的准确性和真实性仍有显著提升的空间。本研究强调了 ChatGPT-4 作为医学领域辅助教育工具的价值,需要专家监督才能实现最佳应用。