• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

甲状腺眼病与人工智能:ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

作者信息

Bahir Daniel, Hartstein Morris, Zloto Ofira, Burkat Cat, Uddin Jimmy, Hamed Azzam Shirin

机构信息

Ophthalmology Department, Tzafon Medical Center, Azrieli Faculty of Medicine, Bar Ilan University, Israel.

Department of Ophthalmology, Shamir Medical Center, Tzrifin, Israel.

出版信息

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

DOI:10.1097/IOP.0000000000002882
PMID:39718146
Abstract

PURPOSE

This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.

METHODS

Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.

RESULTS

GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.

CONCLUSIONS

GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.

摘要

目的

本研究旨在比较三种人工智能语言模型——GPT-3.5、GPT-4o和Gemini在提供以患者为中心的甲状腺眼病(TED)信息方面的有效性。我们根据它们对TED患者常见问题的回答的准确性和全面性来评估其性能。该研究未评估人工智能回答的可重复性,重点是对每个模型进行单轮评估。

方法

五位经验丰富的眼整形外科医生评估了人工智能模型对TED患者经常问到的12个关键问题的回答。这些问题涉及TED的病理生理学、危险因素、临床表现、诊断测试和治疗选择。每个回答根据正确性和可靠性在7分李克特量表上进行评分,1表示信息不正确或不可靠,7表示高度准确和可靠的信息。正确性指事实准确性,而可靠性评估对患者使用的可信度。评估采用匿名方式,最终得分由外科医生平均计算,以便于模型比较。

结果

GPT-3.5表现最佳,平均正确性得分为5.75,可靠性得分为5.68,在提供关于TED治疗和手术干预等复杂主题的详细信息方面表现出色。GPT-4o其次,正确性得分为5.32,可靠性得分为5.25,通常提供准确但不太详细的信息。Gemini排名最后,正确性得分为5.10,可靠性得分为4.70,对于较简单的问题通常能提供足够的回答,但在二线免疫抑制治疗等复杂领域缺乏细节。使用弗里德曼检验的统计分析表明,关键主题的模型之间存在显著差异(p < 0.05),GPT-3.5始终领先。

结论

GPT-3.5是提供可靠和全面患者信息的最有效模型,特别是对于复杂的治疗和手术主题。GPT-4o提供可靠的一般信息,但缺乏专业主题所需的深度,而Gemini适合回答基本的患者问题,但不足以提供详细的医学信息。本研究强调了人工智能在患者教育中的作用,表明像GPT-3.5这样的模型可以成为临床医生增强患者对TED理解的有价值工具。

相似文献

1
Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.甲状腺眼病与人工智能:ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究
Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.
2
Assessment of Recommendations Provided to Athletes Regarding Sleep Education by GPT-4o and Google Gemini: Comparative Evaluation Study.GPT-4o和谷歌Gemini向运动员提供的关于睡眠教育的建议评估:比较评估研究
JMIR Form Res. 2025 Jul 8;9:e71358. doi: 10.2196/71358.
3
Performance of 3 Conversational Generative Artificial Intelligence Models for Computing Maximum Safe Doses of Local Anesthetics: Comparative Analysis.用于计算局部麻醉药最大安全剂量的3种对话式生成人工智能模型的性能:比较分析
JMIR AI. 2025 May 13;4:e66796. doi: 10.2196/66796.
4
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较:随机对照试验
JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.
5
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
6
Using Artificial Intelligence ChatGPT to Access Medical Information about Chemical Eye Injuries: A Comparative Study.使用人工智能ChatGPT获取有关化学性眼外伤的医学信息:一项比较研究。
JMIR Form Res. 2025 Jun 30. doi: 10.2196/73642.
7
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
8
Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能:ChatGPT与谷歌Gemini的较量
Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.
9
Enhancing Magnetic Resonance Imaging (MRI) Report Comprehension in Spinal Trauma: Readability Analysis of AI-Generated Explanations for Thoracolumbar Fractures.提高脊柱创伤磁共振成像(MRI)报告的理解:胸腰椎骨折人工智能生成解释的可读性分析
JMIR AI. 2025 Jul 1;4:e69654. doi: 10.2196/69654.
10
Assessment of readability, reliability, and quality of large language models in addressing frequently asked questions regarding prenatal screening for fetal chromosomal anomalies.评估大语言模型在解答有关胎儿染色体异常产前筛查常见问题方面的可读性、可靠性和质量。
Int J Gynaecol Obstet. 2025 Jul 1. doi: 10.1002/ijgo.70348.

引用本文的文献

1
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
2
Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English.中国生成式人工智能模型(通义千问和文心一言)在眼科问题查询方面可与ChatGPT-4相媲美,在阿拉伯语和英语方面表现出色。
Narra J. 2025 Apr;5(1):e2371. doi: 10.52225/narra.v5i1.2371. Epub 2025 Apr 8.