Bahir Daniel, Hartstein Morris, Zloto Ofira, Burkat Cat, Uddin Jimmy, Hamed Azzam Shirin
Ophthalmology Department, Tzafon Medical Center, Azrieli Faculty of Medicine, Bar Ilan University, Israel.
Department of Ophthalmology, Shamir Medical Center, Tzrifin, Israel.
Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.
This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.
Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.
GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.
GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.
本研究旨在比较三种人工智能语言模型——GPT-3.5、GPT-4o和Gemini在提供以患者为中心的甲状腺眼病(TED)信息方面的有效性。我们根据它们对TED患者常见问题的回答的准确性和全面性来评估其性能。该研究未评估人工智能回答的可重复性,重点是对每个模型进行单轮评估。
五位经验丰富的眼整形外科医生评估了人工智能模型对TED患者经常问到的12个关键问题的回答。这些问题涉及TED的病理生理学、危险因素、临床表现、诊断测试和治疗选择。每个回答根据正确性和可靠性在7分李克特量表上进行评分,1表示信息不正确或不可靠,7表示高度准确和可靠的信息。正确性指事实准确性,而可靠性评估对患者使用的可信度。评估采用匿名方式,最终得分由外科医生平均计算,以便于模型比较。
GPT-3.5表现最佳,平均正确性得分为5.75,可靠性得分为5.68,在提供关于TED治疗和手术干预等复杂主题的详细信息方面表现出色。GPT-4o其次,正确性得分为5.32,可靠性得分为5.25,通常提供准确但不太详细的信息。Gemini排名最后,正确性得分为5.10,可靠性得分为4.70,对于较简单的问题通常能提供足够的回答,但在二线免疫抑制治疗等复杂领域缺乏细节。使用弗里德曼检验的统计分析表明,关键主题的模型之间存在显著差异(p < 0.05),GPT-3.5始终领先。
GPT-3.5是提供可靠和全面患者信息的最有效模型,特别是对于复杂的治疗和手术主题。GPT-4o提供可靠的一般信息,但缺乏专业主题所需的深度,而Gemini适合回答基本的患者问题,但不足以提供详细的医学信息。本研究强调了人工智能在患者教育中的作用,表明像GPT-3.5这样的模型可以成为临床医生增强患者对TED理解的有价值工具。