Nielsen Jacob P S, Mikkelsen August Krogh, Kuenzel Julian, Sebelik Merry E, Madani Gitta, Yang Tsung-Lin, Todsen Tobias
Department of Otorhinolaryngology, Head and Neck Surgery and Audiology, Copenhagen University Hospital (Rigshospitalet), 2100 Copenhagen, Denmark.
Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark.
Diagnostics (Basel). 2025 Jul 22;15(15):1848. doi: 10.3390/diagnostics15151848.
: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. : Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. : Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions ( = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items ( = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. : Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management.
耳鼻喉科医生越来越多地使用头颈部超声(HNUS)。确定HNUS从业者是否具备足够的理论知识仍然是一项挑战。本研究评估了两个大语言模型(LLMs)在生成头颈部超声选择题(MCQs)方面的表现,并与医生生成的MCQs进行比较。
医生和大语言模型(ChatGPT、GPT4o和谷歌Gemini、Gemini Advanced)共创建了90道MCQs,涵盖淋巴结、甲状腺和唾液腺等主题。HNUS专家还使用类似德尔菲法的流程对所有医生起草的MCQs进行了评估。这些MCQs由HNUS国际专家小组进行评估,专家们对问题来源不知情。评估采用李克特量表,基于包括六个评估标准的总体评估:清晰度、相关性、适用性、干扰项质量、答案的充分理由以及难度水平评估。
四位HNUS临床领域专家对这90道MCQs进行了评估。两个大语言模型之间未观察到显著差异。医生起草的问题(n = 30)与谷歌Gemini在相关性、适用性和答案的充分理由方面存在显著差异,但与ChatGPT相比,仅在适用性方面存在显著差异。与医学专家验证的MCQ项目(n = 16)相比,大语言模型构建的MCQ项目在所有标准上得分均显著较低。MCQs的难度水平相同。
我们的研究表明,两个大语言模型都可用于生成质量与医生草稿相当的MCQ项目。然而,大语言模型生成的MCQ项目质量仍显著低于经超声专家验证的MCQs。因此,大语言模型在生成MCQ项目初稿方面具有成本效益,之后在用于评估目的之前应由专家进行验证。通过这种方式,大语言模型的价值不在于取代人类,而在于具有极大优势的时间管理。