Ji Hongwei, Wang Xiaofei, Sia Ching-Hui, Yap Jonathan, Lim Soo Teik, Djohan Andie Hartanto, Chang Yaowei, Zhang Ning, Guo Mengqi, Li Fuhai, Lim Zhi Wei, Wang Ya Xing, Sheng Bin, Wong Tien Yin, Cheng Susan, Yeo Khung Keong, Tham Yih-Chung
Beijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China.
Department of Cardiology, The Affiliated Hospital of Qingdao University, Shandong, China.
Commun Med (Lond). 2025 May 16;5(1):177. doi: 10.1038/s43856-025-00802-0.
Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear.
We evaluated capabilities of BARD (Google's bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI's conversational models for generating human-like text) and ERNIE (Baidu's knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate).
For English prompts, the chatbots' appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the "appropriate" ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness.
For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs' performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages.
大语言模型(LLM)有望解决与心血管疾病(CVD)预防相关的外行人疑问。然而,当前通用大语言模型提供信息的准确性和一致性仍不明确。
我们评估了BARD(谷歌用于语义理解的双向语言模型)、ChatGPT-3.5、ChatGPT-4.0(OpenAI用于生成类人文本的对话模型)和ERNIE(百度用于上下文理解的知识增强语言模型)在处理中英文CVD预防问题方面的能力。向每个大语言模型提出了75个CVD预防问题。主要结果是回答的准确性(评为适当、临界、不适当)。
对于英文提示,聊天机器人的适当评分如下:BARD为88.0%,ChatGPT-3.5为92.0%,ChatGPT-4.0为97.3%。所有模型在最初次优的回答中都显示出时间上的改进,BARD和ChatGPT-3.5各提高了67%(6/9和4/6),ChatGPT-4.0的改进率达到100%(2/2)。BARD和ChatGPT-4.0在识别自己回答的正确性方面都优于ChatGPT-3.5。对于中文提示,“适当”评分如下:ERNIE为84.0%,ChatGPT-3.5为88.0%,ChatGPT-4.0为85.3%。然而,ERNIE在时间改进和对正确性的自我认知方面优于ChatGPT-3.5和ChatGPT-4.0。
对于英文的CVD预防问题,ChatGPT-4.0在生成适当回答、时间改进和自我认知方面优于其他大语言模型。对于中文问题,大语言模型的表现略有下降,反映出这些大语言模型中潜在的语言偏差。鉴于大语言模型聊天机器人的可用性和可及性不断提高,定期进行严格评估对于全面评估它们在广泛使用的语言中提供的医疗信息的质量和局限性至关重要。