• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型的性能比较分析:ChatGPT-3.5、ChatGPT-4 和 Google Gemini 在糖皮质激素诱导性骨质疏松症中的表现。

Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis.

机构信息

Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China.

Department of Orthopedics, Tianjin Medical University Baodi Hospital, Tianjin, 301800, China.

出版信息

J Orthop Surg Res. 2024 Sep 18;19(1):574. doi: 10.1186/s13018-024-04996-2.

DOI:10.1186/s13018-024-04996-2
PMID:39289734
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11409482/
Abstract

BACKGROUNDS

The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making.

METHODS

We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 ≤ TS ≤ 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses.

RESULTS

In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before.

CONCLUSIONS

Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.

摘要

背景

在医学领域使用大型语言模型(LLM)可以通过提高医疗信息管理、患者护理、医学研究和临床决策的效率,帮助医生提高医疗质量和效果。

方法

我们收集了 34 个关于糖皮质激素诱导性骨质疏松症(GIOP)的常见问题,涵盖了疾病临床表现、发病机制、诊断、治疗、预防和危险因素等相关主题。我们还根据 2022 年美国风湿病学会(ACR)GIOP 防治指南(2022 ACR-GIOP 指南)生成了 25 个问题。每个问题都向 LLM(ChatGPT-3.5、ChatGPT-4 和 Google Gemini)提出,三位资深骨科医生独立对 LLM 生成的回答进行评分。三位资深骨科医生根据 1 到 4 分的评分标准对回答进行独立评分。总分(TS)>9 表示“好”的回答,6≤TS≤9 表示“中等”的回答,TS<6 表示“差”的回答。

结果

对于与 GIOP 相关的一般问题和 2022 ACR-GIOP 指南,Google Gemini 提供的回答比其他 LLM 更简洁。在发病机制方面,ChatGPT-4 的总分(TS)显著高于 ChatGPT-3.5。ChatGPT-4 回答与 2022 ACR-GIOP 指南相关问题的 TS 明显高于 Google Gemini。ChatGPT-3.5 和 ChatGPT-4 的自我纠正后 TS 明显高于纠正前的 TS,而 Google Gemini 自我纠正后的回答与纠正前没有显著差异。

结论

我们的研究表明,与 ChatGPT-3.5 和 ChatGPT-4 相比,Google Gemini 提供的回答更简洁、直观。ChatGPT-4 在回答与 GIOP 和 2022 ACR-GIOP 指南相关的一般问题方面的表现明显优于 ChatGPT3.5 和 Google Gemini。ChatGPT3.5 和 ChatGPT-4 的自我纠正能力优于 Google Gemini。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/326706da6f55/13018_2024_4996_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/6ebdea5ead0f/13018_2024_4996_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/326706da6f55/13018_2024_4996_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/6ebdea5ead0f/13018_2024_4996_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/326706da6f55/13018_2024_4996_Fig2_HTML.jpg

相似文献

1
Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis.大型语言模型的性能比较分析:ChatGPT-3.5、ChatGPT-4 和 Google Gemini 在糖皮质激素诱导性骨质疏松症中的表现。
J Orthop Surg Res. 2024 Sep 18;19(1):574. doi: 10.1186/s13018-024-04996-2.
2
Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o.人工智能模型在风湿病委员会级问题中的比较性能:评估 Google Gemini 和 ChatGPT-4o。
Clin Rheumatol. 2024 Nov;43(11):3507-3513. doi: 10.1007/s10067-024-07154-5. Epub 2024 Sep 28.
3
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
4
Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study.ChatGPT-4和谷歌Gemini在提供视网膜脱离信息方面的准确性和可读性评估:一项多中心专家对比研究。
Int J Retina Vitreous. 2024 Sep 2;10(1):61. doi: 10.1186/s40942-024-00579-9.
5
Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能(AI)的大语言模型在标准化测试中的表现;对人工智能辅助牙科教育的启示。
J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.
6
Comparative Evaluation of AI Models Such as ChatGPT 3.5, ChatGPT 4.0, and Google Gemini in Neuroradiology Diagnostics.ChatGPT 3.5、ChatGPT 4.0和谷歌Gemini等人工智能模型在神经放射学诊断中的比较评估
Cureus. 2024 Aug 25;16(8):e67766. doi: 10.7759/cureus.67766. eCollection 2024 Aug.
7
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
8
Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.Gemini Advanced与ChatGPT 4.0在眼科住院医师眼科知识评估计划(OKAP)考试复习题库中的表现比较。
Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.
9
Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison.大语言模型作为青光眼手术病例的辅助工具:ChatGPT 与 Google Gemini 的对比。
Graefes Arch Clin Exp Ophthalmol. 2024 Sep;262(9):2945-2959. doi: 10.1007/s00417-024-06470-5. Epub 2024 Apr 4.
10
Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?大语言模型能否为家长提供关于慢性肾脏病的准确、高质量信息?
J Eval Clin Pract. 2024 Dec;30(8):1556-1564. doi: 10.1111/jep.14084. Epub 2024 Jul 3.

引用本文的文献

1
Research progress and implications of the application of large language model in shared decision-making in China's healthcare field.大语言模型在中国医疗领域共享决策应用中的研究进展与启示
Front Public Health. 2025 Jul 10;13:1605212. doi: 10.3389/fpubh.2025.1605212. eCollection 2025.
2
Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.大型语言模型在轴性脊柱关节炎管理中的性能评估:对欧洲抗风湿病联盟2022年建议的分析
Diagnostics (Basel). 2025 Jun 7;15(12):1455. doi: 10.3390/diagnostics15121455.
3
Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?

本文引用的文献

1
Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.评估ChatGPT 4.0在美国医师执照考试第二步临床知识考试(USMLE STEP 2 CK)及临床病例报告中的测试表现和临床诊断准确性。
Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x.
2
ChatGPT4's proficiency in addressing patients' questions on systemic lupus erythematosus: a blinded comparative study with specialists.ChatGPT4 在回答系统性红斑狼疮患者问题方面的能力:与专家进行的盲法比较研究。
Rheumatology (Oxford). 2024 Sep 1;63(9):2450-2456. doi: 10.1093/rheumatology/keae238.
3
Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.
流行的人工智能大语言模型能否为有关肩袖撕裂的常见问题提供可靠答案?
JSES Int. 2024 Nov 29;9(2):390-397. doi: 10.1016/j.jseint.2024.11.012. eCollection 2025 Mar.
4
A Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-4o and Google Gemini in Answering Questions About Birth Control Methods.人工智能平台的比较分析:ChatGPT-4o与谷歌Gemini在回答避孕方法相关问题方面的表现
Cureus. 2025 Jan 1;17(1):e76745. doi: 10.7759/cureus.76745. eCollection 2025 Jan.
谷歌 Gemini 和巴德人工智能聊天机器人在眼科知识评估中的表现。
Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13.
4
Evaluating ChatGPT's Capabilities on Orthopedic Training Examinations: An Analysis of New Image Processing Features.评估ChatGPT在骨科训练考试中的能力:对新图像处理功能的分析
Cureus. 2024 Mar 11;16(3):e55945. doi: 10.7759/cureus.55945. eCollection 2024 Mar.
5
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
6
Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison.大语言模型作为青光眼手术病例的辅助工具:ChatGPT 与 Google Gemini 的对比。
Graefes Arch Clin Exp Ophthalmol. 2024 Sep;262(9):2945-2959. doi: 10.1007/s00417-024-06470-5. Epub 2024 Apr 4.
7
Assessing the Efficacy of Large Language Models in Health Literacy: A Comprehensive Cross-Sectional Study.评估大语言模型在健康素养中的功效:一项全面的横断面研究。
Yale J Biol Med. 2024 Mar 29;97(1):17-27. doi: 10.59249/ZTOZ1966. eCollection 2024 Mar.
8
How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information.ChatGPT 与谷歌相比如何使用来源信息?在线健康信息的文本网络分析。
Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.
9
Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性:范围综述。
BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.
10
Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases.探索 AI 聊天机器人在眼科手术规划方面的建议能力:ChatGPT 与 Google Gemini 对视网膜脱离病例的分析比较。
Br J Ophthalmol. 2024 Sep 20;108(10):1457-1469. doi: 10.1136/bjo-2023-325143.