Suppr超能文献

基于临床医生的大型语言模型回答医学问题的比较研究:以哮喘为例。

A clinician-based comparative study of large language models in answering medical questions: the case of asthma.

作者信息

Yin Yong, Zeng Mei, Wang Hansong, Yang Haibo, Zhou Caijing, Jiang Feng, Wu Shufan, Huang Tingyue, Yuan Shuahua, Lin Jilei, Tang Mingyu, Chen Jiande, Dong Bin, Yuan Jiajun, Xie Dan

机构信息

Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China.

Department of Respiratory Medicine, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

Front Pediatr. 2025 Apr 25;13:1461026. doi: 10.3389/fped.2025.1461026. eCollection 2025.

Abstract

OBJECTIVE

This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.

MATERIALS AND METHODS

Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.

RESULTS

GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.

CONCLUSION

GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.

摘要

目的

本研究旨在评估和比较四种主要的大语言模型(GPT - 3.5、GPT - 4.0、YouChat和Perplexity)回答32个常见哮喘相关问题的表现。

材料与方法

来自各三级医院的75名临床医生参与了本研究。每位临床医生负责评估这四种大语言模型对32个与小儿哮喘相关的常见临床问题给出的回答。参与者根据预先定义的标准,主观评估大语言模型答案的准确性、正确性、完整性和实用性。参与者给出精确分数以确定每个语言模型回答小儿哮喘相关问题的表现。

结果

GPT - 4.0在所有维度上表现最佳,而YouChat在所有维度上表现最差。GPT - 3.5和GPT - 4.0均优于其他两个模型,但GPT - 3.5和GPT - 4.0之间以及YouChat和Perplexity之间的表现没有显著差异。

结论

GPT和其他大语言模型能够在一定程度上完整且准确地回答医学问题。然而,临床医生应审慎评估网络信息,辨别真假数据,不应盲目接受这些模型的输出结果。随着关键技术的进步,大语言模型或许有朝一日会成为医生获取信息的安全选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ecf/12062090/609f5caea851/fped-13-1461026-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验