Li Shiwei, Jiang Jun, Yang Xiaodong
Department of Pediatric Surgery, West China Hospital, Sichuan University, Chengdu, China.
J Child Orthop. 2025 Apr 15:18632521251331772. doi: 10.1177/18632521251331772.
OBJECTIVE: To evaluate the performance of three large language models in answering questions regarding pediatric developmental dysplasia of the hip. METHODS: We formulated 18 open-ended clinical questions in both Chinese and English and established a gold standard set of answers to benchmark the responses of the large language models. These questions were presented to ChatGPT-4o, Gemini, and Claude 3.5 Sonnet. The responses were evaluated by two independent reviewers using a 5-point scale. The average score, rounded to the nearest whole number, was taken as the final score. A final score of 4 or 5 indicated an accurate response, whereas a final score of 1, 2, or 3 indicated an inaccurate response. RESULTS: The raters demonstrated a high level of agreement in scoring the answers, with weighted Kappa coefficients of 0.865 for Chinese responses ( < 0.001) and 0.875 for English responses ( < 0.001). No significant differences were observed among the three large language models in terms of accuracy when answering questions, with rates of 83.3%, 77.8%, and 77.8% for Claude 3.5 Sonnet, ChatGPT-4o, and Gemini in the Chinese responses ( = 1), and 83.3%, 83.3%, and 72.2% for ChatGPT-4o, Claude 3.5 Sonnet, and Gemini in the English responses ( = 0.761). In addition, there was no significant difference in the performance of the same large language model between the Chinese and English settings. CONCLUSIONS: Large language models demonstrate high accuracy in delivering information on dysplasia of the hip, maintaining consistent performance across both Chinese and English, which suggests their potential utility as medical support tools. LEVEL OF EVIDENCE: Level II.
目的:评估三种大语言模型回答有关小儿发育性髋关节发育不良问题的性能。 方法:我们用中文和英文制定了18个开放式临床问题,并建立了一套答案的金标准来衡量大语言模型的回答。这些问题被呈现给ChatGPT-4o、Gemini和Claude 3.5 Sonnet。由两名独立评审员使用5分制对回答进行评估。将平均得分四舍五入到最接近的整数作为最终得分。最终得分为4或5表示回答准确,而最终得分为1、2或3表示回答不准确。 结果:评分者在对答案评分方面表现出高度一致性,中文回答的加权Kappa系数为0.865(<0.001),英文回答的加权Kappa系数为0.875(<0.001)。在回答问题的准确性方面,三种大语言模型之间未观察到显著差异,Claude 3.5 Sonnet、ChatGPT-4o和Gemini在中文回答中的准确率分别为83.3%、77.8%和77.8%(P = 1),ChatGPT-4o、Claude 3.5 Sonnet和Gemini在英文回答中的准确率分别为83.3%、83.3%和72.2%(P = 0.761)。此外,同一大语言模型在中文和英文环境下的性能没有显著差异。 结论:大语言模型在提供髋关节发育不良信息方面表现出高准确性,在中文和英文环境下性能保持一致,这表明它们作为医学支持工具具有潜在效用。 证据级别:二级。
Int J Oral Maxillofac Implants. 2025-6-25
Evid Rep Technol Assess (Full Rep). 2009-3
J Pediatr Orthop. 2024-8-1
Trends Cogn Sci. 2024-6
J Biomed Inform. 2023-9