用于医疗纠纷调解前咨询的大语言模型：与人类专家的比较评估

Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts.

作者信息

Kim Min Seo, Lee Jung Su, Bae Hyuna

机构信息

College of Medicine, Kangwon National University, Chuncheon, Korea.

Korea Medical Dispute Mediation and Arbitration Agency, Seoul, Korea.

出版信息

Healthc Inform Res. 2025 Apr;31(2):200-208. doi: 10.4258/hir.2025.31.2.200. Epub 2025 Apr 30.

DOI:10.4258/hir.2025.31.2.200

PMID:40384071

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12086436/

Abstract

OBJECTIVES

Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.

METHODS

Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency's website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot's responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.

RESULTS

The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.

CONCLUSIONS

A chain-of-thought-based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot's performance in domain-specific applications.

摘要

目的

评估医疗纠纷既需要医学专业知识也需要法律专业知识，这给寻求明确潜在医疗事故索赔的患者带来了挑战。本研究旨在开发并评估一种基于思维链管道的聊天机器人，该管道使用大语言模型（LLM）来提供医疗纠纷咨询服务，并将其表现与人类专家的回复进行比较。

方法

从韩国医疗纠纷调解与仲裁机构的网站收集回顾性咨询案例（n = 279），从中随机选取50个案例作为验证数据集。Claude 3.5 Sonnet模型通过五步思维链管道处理每个咨询请求。38位专家将聊天机器人的回复与原始人类专家的回复进行对比，在四个维度上按5点李克特量表对其进行评分。使用Wilcoxon符号秩检验进行统计分析。

结果

在信息质量（p < 0.001）、理解与推理（p < 0.001）以及总体满意度（p < 0.001）方面，聊天机器人的表现显著优于人类专家。它还表现出更强的倾向生成观点驱动内容（p < 0.001）。尽管评分普遍较高，但评估者指出了聊天机器人遇到困难的具体情况。

结论

基于思维链的大语言模型聊天机器人在提高医疗纠纷咨询质量方面显示出前景，在关键评估指标上优于人类专家。未来的研究应解决法律和背景变异性导致的不准确问题，调查患者的接受度，并进一步优化聊天机器人在特定领域应用中的表现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf9f/12086436/8733cdc74b7e/hir-2025-31-2-200f1.jpg

相似文献

Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts.

Healthc Inform Res. 2025 Apr;31(2):200-208. doi: 10.4258/hir.2025.31.2.200. Epub 2025 Apr 30.

Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.

JAMA Ophthalmol. 2024 Apr 1;142(4):371-375. doi: 10.1001/jamaophthalmol.2023.6917.

Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users.

JMIR Med Inform. 2025 Jan 3;13:e63538. doi: 10.2196/63538.

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.

JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.

Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.

Advancing health coaching: A comparative study of large language model and health coaches.

Artif Intell Med. 2024 Nov;157:103004. doi: 10.1016/j.artmed.2024.103004. Epub 2024 Oct 19.

Dispute cases related to pain management in Korea: analysis of Korea Medical Dispute Mediation and Arbitration Agency data.

Anesth Pain Med (Seoul). 2020 Jan 31;15(1):96-102. doi: 10.17085/apm.2020.15.1.96.

Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation.

JMIR AI. 2025 Mar 26;4:e69820. doi: 10.2196/69820.

Evaluating large language and large reasoning models as decision support tools in emergency internal medicine.

Comput Biol Med. 2025 Jun;192(Pt B):110351. doi: 10.1016/j.compbiomed.2025.110351. Epub 2025 May 12.

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

本文引用的文献

Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.

Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.

JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.

A framework for human evaluation of large language models in healthcare derived from literature review.

NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.

Influence of believed AI involvement on the perception of digital medical advice.

Nat Med. 2024 Nov;30(11):3098-3100. doi: 10.1038/s41591-024-03180-7. Epub 2024 Jul 25.

Medical malpractice liability in large language model artificial intelligence: legal review and policy recommendations.

J Osteopath Med. 2024 Jan 31;124(7):287-290. doi: 10.1515/jom-2023-0229. eCollection 2024 Jul 1.

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.

NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.

Consent-GPT: is it ethical to delegate procedural consent to conversational AI?

J Med Ethics. 2024 Jan 23;50(2):77-83. doi: 10.1136/jme-2023-109347.

Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures.

JAMA Netw Open. 2023 Oct 2;6(10):e2336997. doi: 10.1001/jamanetworkopen.2023.36997.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

A large language model for electronic health records.

NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于医疗纠纷调解前咨询的大语言模型：与人类专家的比较评估

Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts.

作者信息

Kim Min Seo, Lee Jung Su, Bae Hyuna

机构信息

College of Medicine, Kangwon National University, Chuncheon, Korea.

Korea Medical Dispute Mediation and Arbitration Agency, Seoul, Korea.