Niu Yirou, Fu Shuojin, Xuan Zehui, Kang Ruifu, Ren Zhifang, Jin Shuai, Wang Yanling, Xiao Qian
School of Nursing, Capital Medical University, Beijing, China.
Digit Health. 2025 Jul 10;11:20552076251349616. doi: 10.1177/20552076251349616. eCollection 2025 Jan-Dec.
To investigate the performance (accuracy, comprehensiveness, consistency, and the necessary information ratio) of large language models (LLMs) in providing knowledge related to respiratory aspiration, and to explore the potential of using LLMs as training tools.
This study was a non-human-subject evaluative research. Two LLMs (GPT-3.5 and GPT-4) were asked 36 questions (32 objective questions and four subjective questions) about respiratory aspiration in English and Chinese. Responses were scored by two experts against gold standards derived from authoritative books. The accuracy of the two LLMs' responses of objective questions were compared by chi-square test or Fisher exact probability method. For subjective questions, the t-test or Mann-Whitney U test was used to compare the differences between two LLMs.
There was no significant difference in the ratings provided by the two experts. The accuracy scores of objective questions of two LLMs were high. LLMs also performed well on subjective questions, showing high levels of accuracy, comprehensiveness, consistency, and necessary information ratio. And no significant differences were found in the accuracy of the English and Chinese responses to subjective questions between the two LLMs (z = 0.331, = 0.886; z = 1.703, = 0.114). There was no significant difference in the comprehensiveness of the English and Chinese responses between the two LLMs (t = 0.787, = 0.461; t = 1.175, = 0.285).
LLMs demonstrated promising performance in delivering respiratory aspiration-related knowledge and showed promise as supportive tools in training, particularly when their limitations were well understood.
研究大语言模型(LLMs)在提供与呼吸道误吸相关知识方面的性能(准确性、全面性、一致性和必要信息率),并探索将LLMs用作训练工具的潜力。
本研究为非人体评估研究。向两个大语言模型(GPT-3.5和GPT-4)提出了36个关于呼吸道误吸的问题(32个客观问题和4个主观问题),问题采用英文和中文表述。两位专家根据权威书籍得出的金标准对回答进行评分。通过卡方检验或Fisher精确概率法比较两个大语言模型客观问题回答的准确性。对于主观问题,使用t检验或Mann-Whitney U检验比较两个大语言模型之间的差异。
两位专家给出的评分无显著差异。两个大语言模型客观问题的准确率较高。大语言模型在主观问题上也表现良好,在准确性、全面性、一致性和必要信息率方面都处于较高水平。并且两个大语言模型在主观问题的英文和中文回答准确性方面均未发现显著差异(z = 0.331,P = 0.886;z = 1.703,P = 0.114)。两个大语言模型在英文和中文回答的全面性方面也没有显著差异(t = 0.787,P = 0.461;t = 1.175,P = 0.285)。
大语言模型在提供与呼吸道误吸相关知识方面表现出了良好的性能,并有望成为训练中的辅助工具,尤其是在充分了解其局限性的情况下。