评估大语言模型在中医诊断和治疗建议中的作用。

Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations.

作者信息

Liu Yu, Yuan Yishan, Yan Keming, Li Yuanyuan, Sacca Valeria, Hodges Sierra, Cannistra Mattia, Jeong Pauline, Wu Jiani, Kong Jian

机构信息

Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA, USA.

Beijing University of Chinese Medicine, Beijing, China.

出版信息

NPJ Digit Med. 2025 Jul 21;8(1):466. doi: 10.1038/s41746-025-01845-2.

DOI:10.1038/s41746-025-01845-2

PMID:40691277

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12279949/

Abstract

Digital health technologies hold significant potential for reducing global healthcare disparities. Large language models (LLMs) offer new opportunities to enhance access to culturally specific healthcare, including traditional Chinese medicine (TCM). This study evaluated the diagnostic and treatment performance of seven publicly available LLMs using a real-world acupuncture case, comparing their outputs with three professional acupuncturists across five domains: Western diagnosis, TCM diagnosis, acupoint selection, needling technique, and herbal medicine. Twenty-eight expert evaluators from China, South Korea, and the United States assessed the responses using a multilingual survey. LLMs performed comparably to acupuncturists in Western diagnosis and showed variable performance in TCM-specific tasks. GPT-4o, Qwen 2.5 Max, and Doubao 1.5 Pro demonstrated the highest alignment with expert evaluations, particularly in TCM diagnosis and acupoint selection. These findings highlight the potential of general-purpose LLMs to support culturally grounded medical decision-making and reduce access barriers in TCM care systems.

摘要

数字健康技术在减少全球医疗保健差距方面具有巨大潜力。大语言模型（LLMs）为增加获得包括中医（TCM）在内的特定文化背景医疗保健服务提供了新机会。本研究使用一个真实世界的针灸病例评估了七个公开可用的大语言模型的诊断和治疗性能，并将它们的输出与三位专业针灸师在五个领域进行比较：西医诊断、中医诊断、穴位选择、针刺技术和草药。来自中国、韩国和美国的28名专家评估员使用多语言调查问卷对回答进行了评估。大语言模型在西医诊断方面的表现与针灸师相当，而在中医特定任务中表现各异。GPT-4o、文心一言2.5 Max和豆包1.5 Pro与专家评估的一致性最高，尤其是在中医诊断和穴位选择方面。这些发现凸显了通用大语言模型在支持基于文化的医疗决策和减少中医护理系统中的获取障碍方面的潜力。