评估大型语言模型对中医临床实践指南的遵循情况：一项内容分析

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

作者信息

Zhao Weilong, Lai Honghao, Pan Bei, Huang Jiajie, Xia Danni, Bai Chunyang, Liu Jiayi, Liu Jianing, Jin Yinghui, Shang Hongcai, Liu Jianping, Shi Nannan, Liu Jie, Chen Yaolong, Estill Janne, Ge Long

机构信息

Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.

Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China.

出版信息

Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.

DOI:10.3389/fphar.2025.1649041

PMID:40786055

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12331602/

Abstract

OBJECTIVE

Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain. This study aims to assess the adherence of LLMs to Clinical Practice Guidelines (CPGs) in CM.

METHODS

This cross-sectional study randomly selected ten CPGs in CM and constructed 150 questions across three categories: medication based on differential diagnosis (MDD), specific prescription consultation (SPC), and CM theory analysis (CTA). Eight LLMs (GPT-4o, Claude-3.5 Sonnet, Moonshot-v1, ChatGLM-4, DeepSeek-v3, DeepSeek-r1, Claude-4 sonnet, and Claude-4 sonnet thinking) were evaluated using both English and Chinese queries. The main evaluation metrics included accuracy, readability, and use of safety disclaimers.

RESULTS

Overall, DeepSeek-v3 and DeepSeek-r1 demonstrated superior performance in both English (median 5.00, interquartile range (IQR) 4.00-5.00 vs. median 5.00, IQR 3.70-5.00) and Chinese (both median 5.00, IQR 4.30-5.00), significantly outperforming all other models. All models achieved significantly higher accuracy in Chinese versus English responses (all p < 0.05). Significant variations in accuracy were observed across the categories of questions, with MDD and SPC questions presenting more challenges than CTA questions. English responses had lower readability (mean flesch reading ease score 32.7) compared to Chinese responses. Moonshot-v1 provided the highest rate of safety disclaimers (98.7% English, 100% Chinese).

CONCLUSION

LLMs showed varying degrees of potential for acquiring CM knowledge. The performance of DeepSeek-v3 and DeepSeek-r1 is satisfactory. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.

摘要

目的

大语言模型（LLMs）能否有效促进中医知识获取仍不确定。本研究旨在评估大语言模型在中医临床实践指南（CPGs）方面的遵循情况。

方法

这项横断面研究随机选取了十条中医临床实践指南，并构建了150个问题，分为三类：基于鉴别诊断的用药（MDD）、特定处方咨询（SPC）和中医理论分析（CTA）。使用英语和中文查询对八个大语言模型（GPT-4o、Claude-3.5 Sonnet、Moonshot-v1、ChatGLM-4、DeepSeek-v3、DeepSeek-r1、Claude-4 sonnet和Claude-4 sonnet thinking）进行评估。主要评估指标包括准确性、可读性和安全免责声明的使用情况。

结果

总体而言，DeepSeek-v3和DeepSeek-r1在英语（中位数5.00，四分位间距（IQR）4.00 - 5.00，而其他模型中位数5.00，IQR 3.70 - 5.00）和中文（两者中位数均为5.00，IQR 4.30 - 5.00）方面均表现出卓越性能，显著优于所有其他模型。与英语回答相比，所有模型在中文回答中均取得了显著更高的准确性（所有p < 0.05）。不同类别的问题在准确性上存在显著差异，MDD和SPC问题比CTA问题更具挑战性。与中文回答相比，英语回答的可读性较低（平均弗莱什易读性得分32.7）。Moonshot-v1提供安全免责声明的比例最高（英语为98.7%，中文为100%）。

结论

大语言模型在获取中医知识方面展现出不同程度的潜力。DeepSeek-v3和DeepSeek-r1的性能令人满意。优化大语言模型以成为传播中医信息的有效工具是未来发展的一个重要方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af6a/12331602/530862a1af94/fphar-16-1649041-g001.jpg

相似文献

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis.

J Med Internet Res. 2025 Aug 12;27:e73540. doi: 10.2196/73540.

Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.

J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.

A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.

BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.

Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.

BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.

Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.

Comparative Analysis of Generative Artificial Intelligence Systems in Solving Clinical Pharmacy Problems: Mixed Methods Study.

JMIR Med Inform. 2025 Jul 24;13:e76128. doi: 10.2196/76128.

Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

BJU Int. 2025 Jul 31. doi: 10.1111/bju.16873.

本文引用的文献

Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.

JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

Large Language Models in Medicine: Applications, Challenges, and Future Directions.

Int J Med Sci. 2025 May 31;22(11):2792-2801. doi: 10.7150/ijms.111780. eCollection 2025.

Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study.

JMIR Med Educ. 2025 Mar 19;11:e58897. doi: 10.2196/58897.

Assessing the performance of AI chatbots in answering patients' common questions about low back pain.

Ann Rheum Dis. 2025 Jan;84(1):143-149. doi: 10.1136/ard-2024-226202. Epub 2025 Jan 2.

Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review.

BMC Med Educ. 2024 Oct 7;24(1):1096. doi: 10.1186/s12909-024-06048-z.

AI empowering traditional Chinese medicine?

Chem Sci. 2024 Sep 23;15(41):16844-86. doi: 10.1039/d4sc04107k.

Clinical practice guidelines of Chinese patent medicine in China: A critical review.

Complement Ther Med. 2024 Oct;85:103077. doi: 10.1016/j.ctim.2024.103077. Epub 2024 Aug 23.

A future role for health applications of large language models depends on regulators enforcing safety standards.

Lancet Digit Health. 2024 Sep;6(9):e662-e672. doi: 10.1016/S2589-7500(24)00124-9.

Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.

Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.

[Application scenario design and prospect of generative artificial intelligence (AI) in intelligent manufacturing and supply chain of traditional Chinese medicine].

Zhongguo Zhong Yao Za Zhi. 2024 Jul;49(14):3963-3970. doi: 10.19540/j.cnki.cjcmm.20240402.301.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估大型语言模型对中医临床实践指南的遵循情况：一项内容分析

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

作者信息

Zhao Weilong, Lai Honghao, Pan Bei, Huang Jiajie, Xia Danni, Bai Chunyang, Liu Jiayi, Liu Jianing, Jin Yinghui, Shang Hongcai, Liu Jianping, Shi Nannan, Liu Jie, Chen Yaolong, Estill Janne, Ge Long

机构信息

Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.

Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China.