Zhao Fang-Fang, He Han-Jie, Liang Jia-Jian, Cen Jingyun, Wang Yun, Lin Hongjie, Chen Feifei, Li Tai-Ping, Yang Jian-Feng, Chen Lan, Cen Ling-Ping
Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China.
Shantou University Medical College, Shantou, Guangdong, China.
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
BACKGROUND/OBJECTIVE: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.
Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.
Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.
Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
背景/目的:本研究旨在运用细致的分级方法,评估各种大语言模型(LLMs)(ChatGPT - 3.5、Gemini、Claude 3和GPT - 4.0)在葡萄膜炎临床背景下生成回答的准确性、全面性和可读性。
向四个大语言模型(LLMs):ChatGPT(GPT - 3.5和GPT - 4.0版本)、谷歌Gemini和Claude分别单独提出27个葡萄膜炎临床问题。三位经验丰富的葡萄膜炎专家在三轮评估中,使用三点量表独立评估回答的准确性,每次评估间隔48小时。每个大语言模型回答的最终准确性评级(“优秀”、“边缘”或“不足”)通过多数共识法确定。对于在最终准确性评估中评为“优秀”的回答,使用三点量表评估全面性。使用弗莱什 - 金凯德年级水平公式确定可读性。进行统计分析以辨别大语言模型之间的显著差异,显著性阈值设定为p < 0.05。
与Gemini相比,Claude 3和ChatGPT 4的准确性显著更高(p < 0.001)。Claude 3的“优秀”评级比例也最高(96.3%),其次是ChatGPT 4(88.9%)。与Gemini不同(14.8%)(p = 0.014),ChatGPT 3.5、Claude 3和ChatGPT 4没有回答被评为“不足”。与Gemini相比,ChatGPT 4表现出更高的全面性(p = 0.008),与Gemini相比,Claude 3表现出更高的全面性(p = 0.042)。与ChatGPT 3.5、Claude 3和ChatGPT 4相比,Gemini的可读性显著更好(p < 0.001)。与ChatGPT 3.5和Claude 3相比,Gemini的单词、字母字符和句子数量也更少。
我们的研究突出了Claude 3和ChatGPT 4在提供关于葡萄膜炎的精确和全面信息方面的出色表现,超过了Gemini。ChatGPT 4和Claude 3成为改善患者对葡萄膜炎医疗过程的理解和参与度的关键工具。