Shi Runhan, Liu Steven, Xu Xinwei, Ye Zhengqiang, Yang Jin, Le Qihua, Qiu Jini, Tian Lijia, Wei Anji, Shan Kun, Zhao Chen, Sun Xinghuai, Zhou Xingtao, Hong Jiaxu
Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China.
NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China.
Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.
To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED).
Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase.
Eight board-certified ophthalmologists and 46 patients with DED.
The chatbots' responses to Chinese patients' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains.
Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase.
In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, < 0.05). However, the readability analysis revealed that GPT-4's responses were highly complex, with an average score of 12.86 ( < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2's responses was relatively poor (4.04 vs. 4.48 for GPT-4, < 0.05). Nevertheless, Baichuan 2's recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, < 0.05; ophthalmologist readability: 2.67 vs. 4.33).
The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.
评估四种大语言模型(LLMs)——GPT-4、PaLM 2、通义千问和百川2——对中国干眼症(DED)患者询问生成回答的表现。
两阶段研究,第一阶段为横断面测试,第二阶段为真实世界临床评估。
八位获得委员会认证的眼科医生和46名干眼症患者。
通过评估来评定聊天机器人对中国干眼症患者询问的回答。在第一阶段,六位资深眼科医生使用5点李克特量表,从正确性、完整性、可读性、帮助性和安全性五个领域对聊天机器人的回答进行主观评分。使用中文可读性分析平台进行客观可读性分析。在第二阶段,46名有代表性的干眼症患者向在第一阶段表现最佳的两种语言模型(GPT-4和百川2)提问,然后对答案的满意度和可读性进行评分。随后,两位资深眼科医生对五个领域的回答进行评估。
第一阶段五个领域的主观评分和客观可读性评分。第二阶段患者的满意度、可读性评分以及五个领域的主观评分。
在第一阶段,GPT-4在五个领域均表现出卓越性能(正确性:4.47;完整性:4.39;可读性:4.47;帮助性:4.49;安全性:4.47,<0.05)。然而,可读性分析显示,GPT-4的回答非常复杂,平均得分为12.86(<0.05),而通义千问、百川2和PaLM 2的得分分别为10.87、11.53和11.26。在第二阶段,从五个领域的评分来看,GPT-4和百川2都擅长回答干眼症患者提出的问题。然而,百川2回答的完整性相对较差(GPT-4为4.48,百川2为4.04,<0.05)。尽管如此,百川2的建议比GPT-4的更易懂(患者可读性:3.91对4.61,<0.05;眼科医生可读性:2.67对4.33)。
研究结果强调了大语言模型的潜力,尤其是GPT-4和百川2在为中国干眼症患者的问题提供准确全面回答方面的潜力。