Jiang Zehua, Xu Yueyuan, Lim Zhi Wei, Wang Ziyao, Han Yingxiang, Yew Samantha Min Er, Pan Zhe, Wang Qian, Wu Gangyue, Wong Tien Yin, Wang Xiaofei, Wang Yaxing, Tham Yih Chung
Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China.
Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions.
Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. "Good"-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. "Poor"-rating responses were further prompted for self-correction and re-analysis.
The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% "Good" responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a "Poor" rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%.
Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.
主要基于西方数据训练的全球大型语言模型(LLMs)在其他环境和语言中针对疾病的表现尚不清楚。以近视为例,我们评估了全球通用与中文领域的大型语言模型在解决特定于中文的近视相关问题方面的能力。
纳入了全球大型语言模型(ChatGPT - 3.5、ChatGPT - 4.0、谷歌巴德、Llama - 2 7B Chat)和中文领域大型语言模型(华佗GPT、医典GPT、阿里通义千问、百度文心一言、百度文心大模型4.0)。所有大型语言模型都被要求回答10个领域中的39个特定于中文的近视问题。3位近视专家使用3分制对回答的准确性进行评估。对评为“好”的回答,进一步使用5分制评估其全面性和同理心。对评为“差”的回答,进一步要求其自我纠正并重新分析。
准确性排名前三的大型语言模型是ChatGPT - 3.5(8.72 ± 0.75)、百度文心大模型4.0(8.62 ± 0.62)和ChatGPT - 4.0(8.59 ± 0.93),“好”回答的比例最高,为94.8%。全面性排名前五的大型语言模型是ChatGPT - 3.5(4.58 ± 0.42)、ChatGPT - 4.0(4.56 ± 0.50)、百度文心大模型4.0(4.44 ± 0.49)、医典GPT(4.34 ± 0.59)和百度文心一言(4.22 ± 0.74)(与ChatGPT - 3.5相比,所有p ≥ 0.059)。同理心方面排名前五的是ChatGPT - 3.5(4.75 ± 0.25)、ChatGPT - 4.0(4.68 ± 0.32)、医典GPT(4.50 ± 0.47)、百度文心一言(4.42 ± 0.46)和百度文心大模型4.0(4.34 ± 0.64)(与ChatGPT - 3.5相比,所有p ≥ 0.052)。百度文心大模型4.0没有得到“差”的评分,而其他模型展示了自我纠正能力,改进幅度在50%到100%之间。
全球通用和中文领域的大型语言模型在解决特定于中文的近视相关问题方面表现出有效性能。尽管主要使用非中文数据并以英语进行训练,但全球通用大型语言模型在中文环境中表现出最佳性能。