基于临床医生的大型语言模型回答医学问题的比较研究：以哮喘为例。

A clinician-based comparative study of large language models in answering medical questions: the case of asthma.

作者信息

Yin Yong, Zeng Mei, Wang Hansong, Yang Haibo, Zhou Caijing, Jiang Feng, Wu Shufan, Huang Tingyue, Yuan Shuahua, Lin Jilei, Tang Mingyu, Chen Jiande, Dong Bin, Yuan Jiajun, Xie Dan

机构信息

Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China.

Department of Respiratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

Front Pediatr. 2025 Apr 25;13:1461026. doi: 10.3389/fped.2025.1461026. eCollection 2025.

DOI:10.3389/fped.2025.1461026

PMID:40352607

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12062090/

Abstract

OBJECTIVE

This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.

MATERIALS AND METHODS

Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.

RESULTS

GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.

CONCLUSION

GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.

摘要