Suppr超能文献

大型语言模型的性能比较分析:ChatGPT-3.5、ChatGPT-4 和 Google Gemini 在糖皮质激素诱导性骨质疏松症中的表现。

Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis.

机构信息

Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China.

Department of Orthopedics, Tianjin Medical University Baodi Hospital, Tianjin, 301800, China.

出版信息

J Orthop Surg Res. 2024 Sep 18;19(1):574. doi: 10.1186/s13018-024-04996-2.

Abstract

BACKGROUNDS

The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making.

METHODS

We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 ≤ TS ≤ 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses.

RESULTS

In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before.

CONCLUSIONS

Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.

摘要

背景

在医学领域使用大型语言模型(LLM)可以通过提高医疗信息管理、患者护理、医学研究和临床决策的效率,帮助医生提高医疗质量和效果。

方法

我们收集了 34 个关于糖皮质激素诱导性骨质疏松症(GIOP)的常见问题,涵盖了疾病临床表现、发病机制、诊断、治疗、预防和危险因素等相关主题。我们还根据 2022 年美国风湿病学会(ACR)GIOP 防治指南(2022 ACR-GIOP 指南)生成了 25 个问题。每个问题都向 LLM(ChatGPT-3.5、ChatGPT-4 和 Google Gemini)提出,三位资深骨科医生独立对 LLM 生成的回答进行评分。三位资深骨科医生根据 1 到 4 分的评分标准对回答进行独立评分。总分(TS)>9 表示“好”的回答,6≤TS≤9 表示“中等”的回答,TS<6 表示“差”的回答。

结果

对于与 GIOP 相关的一般问题和 2022 ACR-GIOP 指南,Google Gemini 提供的回答比其他 LLM 更简洁。在发病机制方面,ChatGPT-4 的总分(TS)显著高于 ChatGPT-3.5。ChatGPT-4 回答与 2022 ACR-GIOP 指南相关问题的 TS 明显高于 Google Gemini。ChatGPT-3.5 和 ChatGPT-4 的自我纠正后 TS 明显高于纠正前的 TS,而 Google Gemini 自我纠正后的回答与纠正前没有显著差异。

结论

我们的研究表明,与 ChatGPT-3.5 和 ChatGPT-4 相比,Google Gemini 提供的回答更简洁、直观。ChatGPT-4 在回答与 GIOP 和 2022 ACR-GIOP 指南相关的一般问题方面的表现明显优于 ChatGPT3.5 和 Google Gemini。ChatGPT3.5 和 ChatGPT-4 的自我纠正能力优于 Google Gemini。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/11409482/6ebdea5ead0f/13018_2024_4996_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验