医生和大语言模型创建的头颈部超声选择题测试评估

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models.

作者信息

Nielsen Jacob P S, Mikkelsen August Krogh, Kuenzel Julian, Sebelik Merry E, Madani Gitta, Yang Tsung-Lin, Todsen Tobias

机构信息

Department of Otorhinolaryngology, Head and Neck Surgery and Audiology, Copenhagen University Hospital (Rigshospitalet), 2100 Copenhagen, Denmark.

Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark.

出版信息

Diagnostics (Basel). 2025 Jul 22;15(15):1848. doi: 10.3390/diagnostics15151848.

DOI:10.3390/diagnostics15151848

PMID:40804813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12346108/

Abstract

: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. : Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. : Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions ( = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items ( = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. : Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management.

摘要

耳鼻喉科医生越来越多地使用头颈部超声（HNUS）。确定HNUS从业者是否具备足够的理论知识仍然是一项挑战。本研究评估了两个大语言模型（LLMs）在生成头颈部超声选择题（MCQs）方面的表现，并与医生生成的MCQs进行比较。

医生和大语言模型（ChatGPT、GPT4o和谷歌Gemini、Gemini Advanced）共创建了90道MCQs，涵盖淋巴结、甲状腺和唾液腺等主题。HNUS专家还使用类似德尔菲法的流程对所有医生起草的MCQs进行了评估。这些MCQs由HNUS国际专家小组进行评估，专家们对问题来源不知情。评估采用李克特量表，基于包括六个评估标准的总体评估：清晰度、相关性、适用性、干扰项质量、答案的充分理由以及难度水平评估。

四位HNUS临床领域专家对这90道MCQs进行了评估。两个大语言模型之间未观察到显著差异。医生起草的问题（n = 30）与谷歌Gemini在相关性、适用性和答案的充分理由方面存在显著差异，但与ChatGPT相比，仅在适用性方面存在显著差异。与医学专家验证的MCQ项目（n = 16）相比，大语言模型构建的MCQ项目在所有标准上得分均显著较低。MCQs的难度水平相同。

我们的研究表明，两个大语言模型都可用于生成质量与医生草稿相当的MCQ项目。然而，大语言模型生成的MCQ项目质量仍显著低于经超声专家验证的MCQs。因此，大语言模型在生成MCQ项目初稿方面具有成本效益，之后在用于评估目的之前应由专家进行验证。通过这种方式，大语言模型的价值不在于取代人类，而在于具有极大优势的时间管理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa1a/12346108/f79918dff673/diagnostics-15-01848-g001.jpg

相似文献

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models.

Diagnostics (Basel). 2025 Jul 22;15(15):1848. doi: 10.3390/diagnostics15151848.

Prescription of Controlled Substances: Benefits and Risks

Quality of Human Expert versus Large Language Model Generated Multiple Choice Questions in the Field of Mechanical Ventilation.

Chest. 2025 Jul 18. doi: 10.1016/j.chest.2025.07.005.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of topotecan for ovarian cancer.

Health Technol Assess. 2001;5(28):1-110. doi: 10.3310/hta5280.

Home treatment for mental health problems: a systematic review.

Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

AI in radiography education: Evaluating multiple-choice questions difficulty and discrimination.

J Med Imaging Radiat Sci. 2025 Mar 28;56(4):101896. doi: 10.1016/j.jmir.2025.101896.

Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator-Generated Multiple-Choice Questions in a Dental Exam.

Eur J Dent Educ. 2025 Aug 10. doi: 10.1111/eje.70034.

ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom).

PLoS One. 2023 Aug 29;18(8):e0290691. doi: 10.1371/journal.pone.0290691. eCollection 2023.

本文引用的文献

Development and validation of a multiple-choice test for head and neck ultrasound certification.

Eur Arch Otorhinolaryngol. 2025 Jul 12. doi: 10.1007/s00405-025-09533-3.

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems.

Digit Health. 2025 Jun 9;11:20552076251348850. doi: 10.1177/20552076251348850. eCollection 2025 Jan-Dec.

Bias in medical AI: Implications for clinical decision-making.

PLOS Digit Health. 2024 Nov 7;3(11):e0000651. doi: 10.1371/journal.pdig.0000651. eCollection 2024 Nov.

Surgeon-performed intraoperative transoral ultrasound improves the detection of human papillomavirus-positive head and neck cancers of unknown primary.

Oral Oncol. 2024 Dec;159:107073. doi: 10.1016/j.oraloncology.2024.107073. Epub 2024 Oct 14.

Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions.

Acad Radiol. 2024 Sep;31(9):3872-3878. doi: 10.1016/j.acra.2024.06.046. Epub 2024 Jul 15.

ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review.

Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065.

Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced.

Eur Arch Otorhinolaryngol. 2024 Sep;281(9):5001-5006. doi: 10.1007/s00405-024-08746-2. Epub 2024 May 25.

ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment.

Med Teach. 2025 Feb;47(2):268-274. doi: 10.1080/0142159X.2024.2327477. Epub 2024 Mar 13.

Head and neck ultrasound training improves the diagnostic performance of otolaryngology residents.

Laryngoscope Investig Otolaryngol. 2024 Jan 11;9(1):e1201. doi: 10.1002/lio2.1201. eCollection 2024 Feb.

Intraoral Ultrasound versus MRI for Depth of Invasion Measurement in Oral Tongue Squamous Cell Carcinoma: A Prospective Diagnostic Accuracy Study.

Cancers (Basel). 2024 Feb 1;16(3):637. doi: 10.3390/cancers16030637.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

医生和大语言模型创建的头颈部超声选择题测试评估

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models.

作者信息

Nielsen Jacob P S, Mikkelsen August Krogh, Kuenzel Julian, Sebelik Merry E, Madani Gitta, Yang Tsung-Lin, Todsen Tobias

机构信息

Department of Otorhinolaryngology, Head and Neck Surgery and Audiology, Copenhagen University Hospital (Rigshospitalet), 2100 Copenhagen, Denmark.

Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark.

出版信息

Diagnostics (Basel). 2025 Jul 22;15(15):1848. doi: 10.3390/diagnostics15151848.

DOI:10.3390/diagnostics15151848

PMID:40804813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12346108/

Abstract

摘要

医生和大语言模型创建的头颈部超声选择题测试评估

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

医生和大语言模型创建的头颈部超声选择题测试评估

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献