Yin Yong, Zeng Mei, Wang Hansong, Yang Haibo, Zhou Caijing, Jiang Feng, Wu Shufan, Huang Tingyue, Yuan Shuahua, Lin Jilei, Tang Mingyu, Chen Jiande, Dong Bin, Yuan Jiajun, Xie Dan
Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China.
Department of Respiratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
Front Pediatr. 2025 Apr 25;13:1461026. doi: 10.3389/fped.2025.1461026. eCollection 2025.
This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.
Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.
GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.
GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.
本研究旨在评估和比较四种主要的大语言模型(GPT - 3.5、GPT - 4.0、YouChat和Perplexity)回答32个常见哮喘相关问题的表现。
来自各三级医院的75名临床医生参与了本研究。每位临床医生负责评估这四种大语言模型对32个与小儿哮喘相关的常见临床问题给出的回答。参与者根据预先定义的标准,主观评估大语言模型答案的准确性、正确性、完整性和实用性。参与者给出精确分数以确定每个语言模型回答小儿哮喘相关问题的表现。
GPT - 4.0在所有维度上表现最佳,而YouChat在所有维度上表现最差。GPT - 3.5和GPT - 4.0均优于其他两个模型,但GPT - 3.5和GPT - 4.0之间以及YouChat和Perplexity之间的表现没有显著差异。
GPT和其他大语言模型能够在一定程度上完整且准确地回答医学问题。然而,临床医生应审慎评估网络信息,辨别真假数据,不应盲目接受这些模型的输出结果。随着关键技术的进步,大语言模型或许有朝一日会成为医生获取信息的安全选择。