Wang Xionghui, Zheng Tianxiao, Liu Bo, Pei Zhi, Meng Kaihan, Ling Changquan
Department of Gastroenterology, No. 967 Hospital of PLA Joint Logistics Support Force, Dalian, China.
School of Traditional Chinese Medicine, Naval Medical University, No. 800, Xiangyin Road, Yangpu District, Shanghai, 200433, China, 86 02181871561.
JMIR Form Res. 2025 Aug 25;9:e66503. doi: 10.2196/66503.
ChatGPT-4.0 and the ChatGLM series are novel conversational large language models (LLMs). ChatGLM includes 3 versions: ChatGLM4 (with internet connectivity but no knowledge base pretraining), ChatGLM4+Knowledge base (combining internet search capabilities with knowledge base pretraining), ChatGLM3-6B (offline knowledge base pretraining but no internet connectivity). The ability of ChatGPT-4.0 and ChatGLM to apply medical knowledge in the Chinese environment has been preliminarily verified, but the potential of the 2 models for clinical assistance in traditional Chinese medicine (TCM) is still unknown.
This study aims to explore the performance of ChatGPT-4.0, ChatGLM4, ChatGLM4+Knowledge base, and ChatGLM3-6B in providing AI-assisted diagnosis and treatment for metabolic dysfunction-associated fatty liver disease within a TCM clinical framework, thereby assessing their potential as TCM clinical decision support tools.
This study evaluated 4 LLMs by providing them with medical records of 87 metabolic dysfunction-associated fatty liver disease cases treated with TCM and querying them about TCM treatment plans. The answering texts from 4 LLMs were evaluated using predefined scoring criteria, focusing on 3 critical dimensions: ability in syndrome differentiation and treatment principles, confusion of concepts between TCM and Western medicine, and comprehensive evaluation of question-answering texts (comprising 6 components: ability to integrate Chinese and Western medicine, ability to formulate treatment plans, health management capacity, disease monitoring ability, self-positioning awareness, and medication safety).
In the evaluation module of "Ability in syndrome differentiation and treatment principles," the performance ranking of the 4 models was: (1) ChatGLM4+ Knowledge Base, (2) ChatGLM4, (3) ChatGLM3-6B, and (4) ChatGPT-4.0. Regarding the assessment of confusion between TCM and Western medicine concepts, ChatGPT-4.0 exhibited conceptual confusion in 32 out of 87 cases, while the ChatGLM series of LLMs showed no such confusion (except for ChatGLM3-6B, which had 1 instance). In the "Comprehensive evaluation of question-answering texts" module (comprising 6 components: ability to integrate Chinese and Western medicine, ability to formulate treatment plans, health management capacity, disease monitoring ability, self-positioning awareness, and medication safety), the ranking was: (1) ChatGLM4+ Knowledge Base, (2) ChatGPT-4.0, (3) ChatGLM4, and (4) ChatGLM3-6B.
Our study results demonstrated that real-time internet connectivity played a critical role in LLM-assisted TCM diagnosis and treatment, while offline models showed significantly reduced performance in clinical decision support. Furthermore, pretraining LLMs with TCM-specific knowledge bases while maintaining internet search capabilities substantially enhanced their diagnostic and therapeutic performance in TCM applications. Importantly, general-purpose LLMs required both domain-specific medical fine-tuning and culturally sensitive adaptation to meet the rigorous standards of TCM clinical practice.
ChatGPT-4.0和ChatGLM系列是新型对话式大语言模型(LLMs)。ChatGLM包括3个版本:ChatGLM4(具有互联网连接但无知识库预训练)、ChatGLM4+知识库(将互联网搜索功能与知识库预训练相结合)、ChatGLM3-6B(离线知识库预训练但无互联网连接)。ChatGPT-4.0和ChatGLM在中国环境中应用医学知识的能力已得到初步验证,但这两种模型在中医临床辅助方面的潜力仍不明确。
本研究旨在探讨ChatGPT-4.0、ChatGLM4、ChatGLM4+知识库和ChatGLM3-6B在中医临床框架内为代谢功能障碍相关脂肪性肝病提供人工智能辅助诊断和治疗的性能,从而评估它们作为中医临床决策支持工具的潜力。
本研究通过向4个大语言模型提供87例接受中医治疗的代谢功能障碍相关脂肪性肝病病例的病历,并询问它们关于中医治疗方案,对这4个大语言模型进行了评估。使用预定义的评分标准对4个大语言模型的回答文本进行评估,重点关注3个关键维度:辨证论治能力、中西医概念混淆情况、问答文本综合评价(包括6个组成部分:中西医结合能力、制定治疗方案能力、健康管理能力、疾病监测能力、自我定位意识、用药安全性)。
在“辨证论治能力”评估模块中,4个模型的性能排名为:(1)ChatGLM4+知识库,(2)ChatGLM4,(3)ChatGLM3-6B,(4)ChatGPT-4.0。关于中西医概念混淆的评估,ChatGPT-4.0在87例中有32例表现出概念混淆,而ChatGLM系列大语言模型未出现此类混淆(ChatGLM3-6B除外,有1例)。在“问答文本综合评价”模块(包括6个组成部分:中西医结合能力、制定治疗方案能力、健康管理能力、疾病监测能力、自我定位意识、用药安全性)中,排名为:(1)ChatGLM4+知识库,(2)ChatGPT-4.0,(3)ChatGLM4,(4)ChatGLM3-6B。
我们的研究结果表明,实时互联网连接在大语言模型辅助中医诊断和治疗中起着关键作用,而离线模型在临床决策支持方面的性能显著降低。此外,在保持互联网搜索功能的同时用中医特定知识库对大语言模型进行预训练,可大幅提高它们在中医应用中的诊断和治疗性能。重要的是,通用大语言模型需要进行特定领域的医学微调以及文化敏感适应,以满足中医临床实践的严格标准。