医生与大语言模型聊天机器人对在线耳鼻喉科问诊回复的比较。

Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries.

作者信息

Motegi Masaomi, Shino Masato, Kuwabara Mikio, Takahashi Hideyuki, Matsuyama Toshiyuki, Tada Hiroe, Hagiwara Hiroyuki, Chikamatsu Kazuaki

机构信息

Department of Otolaryngology-Head and Neck Surgery, Gunma University Graduate School of Medicine, 3-39-15 Showamachi, Maebashi, Gunma, 371-8511, Japan.

Department of Otolaryngology, Maebashi Red Cross Hospital, 389-1 Asakuramachi, Maebashi, Gunma, 371-0811, Japan.

出版信息

Sci Rep. 2025 Jul 1;15(1):21346. doi: 10.1038/s41598-025-06769-1.

DOI:10.1038/s41598-025-06769-1

PMID:40596359

Abstract

Large language models (LLMs) can potentially enhance the accessibility and quality of medical information. This study evaluates the reliability and quality of responses generated by ChatGPT-4, an LLM-driven chatbot, compared to those written by physicians, focusing on otorhinolaryngological advice in real-world, text-based workflows. Responses from a public social media forum were anonymized, and ChatGPT-4 generated corresponding replies. A panel of seven board-certified otorhinolaryngologists assessed both sets of responses using six criteria: overall quality, empathy, alignment with medical consensus, information accuracy, inquiry comprehension, and harm potential. Ordinal logistic regression analysis identified factors influencing response quality. ChatGPT-4 responses were preferred in 70.7% of cases and were significantly longer (median: 162 words) than physician responses (median: 67 words; P < .0001). The chatbot's responses received higher ratings across all criteria, with key predictors of this higher quality being greater empathy, stronger alignment with medical consensus, lower potential for harm, and fewer inaccuracies. ChatGPT-4 consistently outperformed physicians in generating responses that adhered to medical consensus, demonstrated accuracy, and conveyed empathy. These findings suggest that integrating AI tools into text-based healthcare consultations could help physicians better address complex, nuanced inquiries and provide high-quality, comprehensive medical advice.

摘要

大语言模型（LLMs）有可能提高医学信息的可及性和质量。本研究评估了由大型语言模型驱动的聊天机器人ChatGPT-4生成的回复与医生撰写的回复相比的可靠性和质量，重点关注基于文本的现实世界工作流程中的耳鼻喉科建议。来自公共社交媒体论坛的回复进行了匿名处理，ChatGPT-4生成了相应的回复。由七名获得董事会认证的耳鼻喉科医生组成的小组使用六个标准评估了两组回复：整体质量、同理心、与医学共识的一致性、信息准确性、询问理解能力和潜在危害。有序逻辑回归分析确定了影响回复质量的因素。在70.7%的案例中，ChatGPT-4的回复更受青睐，并且明显长于医生的回复（中位数：162个单词）（医生回复中位数：67个单词；P < .0001）。聊天机器人的回复在所有标准上都获得了更高的评分，这种更高质量的关键预测因素是更强的同理心、与医学共识的更强一致性、更低的潜在危害以及更少的不准确之处。在生成符合医学共识、证明准确性并表达同理心的回复方面，ChatGPT-4始终优于医生。这些发现表明，将人工智能工具整合到基于文本的医疗咨询中可以帮助医生更好地处理复杂、细微的询问，并提供高质量、全面的医疗建议。