Kim Min Seo, Lee Jung Su, Bae Hyuna
College of Medicine, Kangwon National University, Chuncheon, Korea.
Korea Medical Dispute Mediation and Arbitration Agency, Seoul, Korea.
Healthc Inform Res. 2025 Apr;31(2):200-208. doi: 10.4258/hir.2025.31.2.200. Epub 2025 Apr 30.
Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.
Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency's website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot's responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.
The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.
A chain-of-thought-based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot's performance in domain-specific applications.
评估医疗纠纷既需要医学专业知识也需要法律专业知识,这给寻求明确潜在医疗事故索赔的患者带来了挑战。本研究旨在开发并评估一种基于思维链管道的聊天机器人,该管道使用大语言模型(LLM)来提供医疗纠纷咨询服务,并将其表现与人类专家的回复进行比较。
从韩国医疗纠纷调解与仲裁机构的网站收集回顾性咨询案例(n = 279),从中随机选取50个案例作为验证数据集。Claude 3.5 Sonnet模型通过五步思维链管道处理每个咨询请求。38位专家将聊天机器人的回复与原始人类专家的回复进行对比,在四个维度上按5点李克特量表对其进行评分。使用Wilcoxon符号秩检验进行统计分析。
在信息质量(p < 0.001)、理解与推理(p < 0.001)以及总体满意度(p < 0.001)方面,聊天机器人的表现显著优于人类专家。它还表现出更强的倾向生成观点驱动内容(p < 0.001)。尽管评分普遍较高,但评估者指出了聊天机器人遇到困难的具体情况。
基于思维链的大语言模型聊天机器人在提高医疗纠纷咨询质量方面显示出前景,在关键评估指标上优于人类专家。未来的研究应解决法律和背景变异性导致的不准确问题,调查患者的接受度,并进一步优化聊天机器人在特定领域应用中的表现。