Suppr超能文献

葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.

作者信息

Zhao Fang-Fang, He Han-Jie, Liang Jia-Jian, Cen Jingyun, Wang Yun, Lin Hongjie, Chen Feifei, Li Tai-Ping, Yang Jian-Feng, Chen Lan, Cen Ling-Ping

机构信息

Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China.

Shantou University Medical College, Shantou, Guangdong, China.

出版信息

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Abstract

BACKGROUND/OBJECTIVE: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.

METHODS

Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.

RESULTS

Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.

CONCLUSIONS

Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.

摘要

背景/目的:本研究旨在运用细致的分级方法,评估各种大语言模型(LLMs)(ChatGPT - 3.5、Gemini、Claude 3和GPT - 4.0)在葡萄膜炎临床背景下生成回答的准确性、全面性和可读性。

方法

向四个大语言模型(LLMs):ChatGPT(GPT - 3.5和GPT - 4.0版本)、谷歌Gemini和Claude分别单独提出27个葡萄膜炎临床问题。三位经验丰富的葡萄膜炎专家在三轮评估中,使用三点量表独立评估回答的准确性,每次评估间隔48小时。每个大语言模型回答的最终准确性评级(“优秀”、“边缘”或“不足”)通过多数共识法确定。对于在最终准确性评估中评为“优秀”的回答,使用三点量表评估全面性。使用弗莱什 - 金凯德年级水平公式确定可读性。进行统计分析以辨别大语言模型之间的显著差异,显著性阈值设定为p < 0.05。

结果

与Gemini相比,Claude 3和ChatGPT 4的准确性显著更高(p < 0.001)。Claude 3的“优秀”评级比例也最高(96.3%),其次是ChatGPT 4(88.9%)。与Gemini不同(14.8%)(p = 0.014),ChatGPT 3.5、Claude 3和ChatGPT 4没有回答被评为“不足”。与Gemini相比,ChatGPT 4表现出更高的全面性(p = 0.008),与Gemini相比,Claude 3表现出更高的全面性(p = 0.042)。与ChatGPT 3.5、Claude 3和ChatGPT 4相比,Gemini的可读性显著更好(p < 0.001)。与ChatGPT 3.5和Claude 3相比,Gemini的单词、字母字符和句子数量也更少。

结论

我们的研究突出了Claude 3和ChatGPT 4在提供关于葡萄膜炎的精确和全面信息方面的出色表现,超过了Gemini。ChatGPT 4和Claude 3成为改善患者对葡萄膜炎医疗过程的理解和参与度的关键工具。

相似文献

引用本文的文献

本文引用的文献

3
Patient-Directed Vasectomy Information: How Readable Is It?患者导向的输精管切除术信息:其可读性如何?
World J Mens Health. 2024 Apr;42(2):408-414. doi: 10.5534/wjmh.230033. Epub 2023 Sep 1.
6
Readability and Suitability of Online Uveitis Patient Education Materials.在线葡萄膜炎患者教育材料的可读性和适宜性。
Ocul Immunol Inflamm. 2024 Sep;32(7):1175-1179. doi: 10.1080/09273948.2023.2203759. Epub 2023 May 5.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验