Köksaldı Seher, Kayabaşı Mustafa, Durmaz Engin Ceren, Grzybowski Andrzej
Department of Ophthalmology, Agri Ibrahim Cecen University, 04200, Agri, Turkey.
Department of Ophthalmology, Mus State Hospital, 49200, Mus, Turkey.
Aesthetic Plast Surg. 2025 Jul 21. doi: 10.1007/s00266-025-05071-9.
This study aimed to evaluate the performance of four large language models (LLMs)-ChatGPT, Gemini, Copilot, and Claude-in responding to upper eyelid blepharoplasty-related questions, focusing on medical accuracy, clinical relevance, response length, and readability.
A set of queries regarding upper eyelid blepharoplasty, covering six categories (anatomy, surgical procedure, additional intraoperative procedures, postoperative monitoring, follow-up, and postoperative complications) were posed to each LLM. An identical prompt establishing clinical context was provided before each question. Responses were evaluated by three ophthalmologists using a 5-point Likert scale for medical accuracy and a 3-point Likert scale for clinical relevance. The length of the responses was assessed. Readability was also evaluated using the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, Coleman-Liau Index, Gunning Fog Index, and Simple Measure of Gobbledygook grade.
A total of 30 standardized questions were presented to each LLM. None of the responses from any LLM received a score of 1 regarding medical accuracy for any question. ChatGPT achieved an 80% 'highly accurate' response rate, followed by Claude (60%), Gemini (40%), and Copilot (20%). None of the responses from ChatGPT and Claude received a score of 1 regarding clinical relevance, whereas 10% of Gemini's responses and 26.7% of Copilot's responses received a score of 1. ChatGPT also provided the most clinically 'relevant' responses (86.7%), outperforming the other LLMs. Copilot generated the shortest responses, while ChatGPT generated the longest. Readability analyses revealed that all responses required advanced reading skills at a 'college graduate' level or higher, with Copilot's responses being the most complex.
ChatGPT demonstrated superior performance in both medical accuracy and clinical relevance among evaluated LLMs regarding upper eyelid blepharoplasty, particularly excelling in postoperative monitoring and follow-up categories. While all models generated complex texts requiring advanced literacy, ChatGPT's detailed responses offer valuable guidance for ophthalmologists managing upper eyelid blepharoplasty cases.
This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
本研究旨在评估四种大语言模型(LLMs)——ChatGPT、Gemini、Copilot和Claude——对上睑成形术相关问题的回答表现,重点关注医学准确性、临床相关性、回答长度和可读性。
向每个大语言模型提出一组关于上睑成形术的问题,涵盖六个类别(解剖学、手术过程、额外的术中操作、术后监测、随访和术后并发症)。在每个问题之前提供相同的建立临床背景的提示。由三位眼科医生使用5点李克特量表评估医学准确性,使用3点李克特量表评估临床相关性。评估回答的长度。还使用弗莱什易读性分数、弗莱什-金凯德年级水平、科尔曼-廖指数、冈宁雾度指数和胡言乱语简易测量等级评估可读性。
每个大语言模型共提出30个标准化问题。对于任何问题,没有一个大语言模型的回答在医学准确性方面得分为1。ChatGPT的“高度准确”回答率达到80%,其次是Claude(60%)、Gemini(40%)和Copilot(20%)。ChatGPT和Claude的回答在临床相关性方面没有一个得分为1,而Gemini的回答中有10%、Copilot的回答中有26.7%得分为1。ChatGPT还提供了最具临床“相关性”的回答(86.7%),优于其他大语言模型。Copilot生成的回答最短,而ChatGPT生成的回答最长。可读性分析表明,所有回答都需要“大学毕业生”及以上水平的高级阅读技能,Copilot的回答最为复杂。
在评估的关于上睑成形术的大语言模型中,ChatGPT在医学准确性和临床相关性方面均表现出色,尤其在术后监测和随访类别中表现优异。虽然所有模型生成的文本都很复杂,需要较高的读写能力,但ChatGPT的详细回答为处理上睑成形术病例的眼科医生提供了有价值的指导。
证据水平V:本杂志要求作者为每篇文章指定证据水平。有关这些循证医学评级的完整描述,请参阅目录或在线作者指南www.springer.com/00266 。