文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

大语言模型在回答有关发育性髋关节发育不良问题时的性能初步评估。

Preliminary assessment of large language models' performance in answering questions on developmental dysplasia of the hip.

作者信息

Li Shiwei, Jiang Jun, Yang Xiaodong

机构信息

Department of Pediatric Surgery, West China Hospital, Sichuan University, Chengdu, China.

出版信息

J Child Orthop. 2025 Apr 15:18632521251331772. doi: 10.1177/18632521251331772.


DOI:10.1177/18632521251331772
PMID:40248439
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11999979/
Abstract

OBJECTIVE: To evaluate the performance of three large language models in answering questions regarding pediatric developmental dysplasia of the hip. METHODS: We formulated 18 open-ended clinical questions in both Chinese and English and established a gold standard set of answers to benchmark the responses of the large language models. These questions were presented to ChatGPT-4o, Gemini, and Claude 3.5 Sonnet. The responses were evaluated by two independent reviewers using a 5-point scale. The average score, rounded to the nearest whole number, was taken as the final score. A final score of 4 or 5 indicated an accurate response, whereas a final score of 1, 2, or 3 indicated an inaccurate response. RESULTS: The raters demonstrated a high level of agreement in scoring the answers, with weighted Kappa coefficients of 0.865 for Chinese responses ( < 0.001) and 0.875 for English responses ( < 0.001). No significant differences were observed among the three large language models in terms of accuracy when answering questions, with rates of 83.3%, 77.8%, and 77.8% for Claude 3.5 Sonnet, ChatGPT-4o, and Gemini in the Chinese responses ( = 1), and 83.3%, 83.3%, and 72.2% for ChatGPT-4o, Claude 3.5 Sonnet, and Gemini in the English responses ( = 0.761). In addition, there was no significant difference in the performance of the same large language model between the Chinese and English settings. CONCLUSIONS: Large language models demonstrate high accuracy in delivering information on dysplasia of the hip, maintaining consistent performance across both Chinese and English, which suggests their potential utility as medical support tools. LEVEL OF EVIDENCE: Level II.

摘要

目的:评估三种大语言模型回答有关小儿发育性髋关节发育不良问题的性能。 方法:我们用中文和英文制定了18个开放式临床问题,并建立了一套答案的金标准来衡量大语言模型的回答。这些问题被呈现给ChatGPT-4o、Gemini和Claude 3.5 Sonnet。由两名独立评审员使用5分制对回答进行评估。将平均得分四舍五入到最接近的整数作为最终得分。最终得分为4或5表示回答准确,而最终得分为1、2或3表示回答不准确。 结果:评分者在对答案评分方面表现出高度一致性,中文回答的加权Kappa系数为0.865(<0.001),英文回答的加权Kappa系数为0.875(<0.001)。在回答问题的准确性方面,三种大语言模型之间未观察到显著差异,Claude 3.5 Sonnet、ChatGPT-4o和Gemini在中文回答中的准确率分别为83.3%、77.8%和77.8%(P = 1),ChatGPT-4o、Claude 3.5 Sonnet和Gemini在英文回答中的准确率分别为83.3%、83.3%和72.2%(P = 0.761)。此外,同一大语言模型在中文和英文环境下的性能没有显著差异。 结论:大语言模型在提供髋关节发育不良信息方面表现出高准确性,在中文和英文环境下性能保持一致,这表明它们作为医学支持工具具有潜在效用。 证据级别:二级。

相似文献

[1]
Preliminary assessment of large language models' performance in answering questions on developmental dysplasia of the hip.

J Child Orthop. 2025-4-15

[2]
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025-6-18

[3]
Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.

J Pers Med. 2025-6-5

[4]
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.

JMIR Med Inform. 2025-6-27

[5]
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024-11-26

[6]
Accuracy and Reliability of Artificial Intelligence Chatbots as Public Information Sources in Implant Dentistry.

Int J Oral Maxillofac Implants. 2025-6-25

[7]
Assessment of readability, reliability, and quality of large language models in addressing frequently asked questions regarding prenatal screening for fetal chromosomal anomalies.

Int J Gynaecol Obstet. 2025-7-1

[8]
Diagnostic Performance of ChatGPT-4o in Detecting Hip Fractures on Pelvic X-rays.

Cureus. 2025-6-24

[9]
Maternal and neonatal outcomes of elective induction of labor.

Evid Rep Technol Assess (Full Rep). 2009-3

[10]
Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions.

J Endod. 2025-6-26

本文引用的文献

[1]
GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.

J Educ Eval Health Prof. 2024

[2]
Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study.

J Med Internet Res. 2024-7-24

[3]
Detecting hallucinations in large language models using semantic entropy.

Nature. 2024-6

[4]
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.

J Med Internet Res. 2024-5-17

[5]
Language-adaptive artificial intelligence: assessing CHATGPT'S answer to frequently asked questions on total hip arthroplasty questions.

J Pak Med Assoc. 2024-4

[6]
Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions.

J Pediatr Orthop. 2024-8-1

[7]
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.

Surg Obes Relat Dis. 2024-7

[8]
Dissociating language and thought in large language models.

Trends Cogn Sci. 2024-6

[9]
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023-12-28

[10]
Annotated dataset creation through large language models for non-english medical NLP.

J Biomed Inform. 2023-9

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索