Dai Yizheng, Shao Xin, Zhang Jinlu, Chen Yulong, Chen Qian, Liao Jie, Chi Fei, Zhang Junhua, Fan Xiaohui
Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314103, China.
Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314103, China; The Joint-laboratory of clinical multi-omics research between Zhejiang University and Ningbo Municipal Hospital of TCM, Ningbo Municipal Hospital of TCM, Ningbo 315000, China.
Pharmacol Res. 2024 Dec;210:107530. doi: 10.1016/j.phrs.2024.107530. Epub 2024 Nov 29.
The utilization of ground-breaking large language models (LLMs) accompanied with dialogue system has been progressively prevalent in the medical domain. Nevertheless, the expertise of LLMs in Traditional Chinese Medicine (TCM) remains restricted despite several TCM LLMs proposed recently. Herein, we introduced TCMChat (https://xomics.com.cn/tcmchat), a generative LLM with pre-training (PT) and supervised fine-tuning (SFT) on large-scale curated TCM text knowledge and Chinese Question-Answering (QA) datasets. In detail, we first compiled a customized collection of six scenarios of Chinese medicine as the training set by text mining and manual verification, involving TCM knowledgebase, choice question, reading comprehension, entity extraction, medical case diagnosis, and herb or formula recommendation. Next, we subjected the model to PT and SFT, using the Baichuan2-7B-Chat as the foundation model. The benchmarking datasets and case studies further demonstrate the superior performance of TCMChat in comparison to existing models. Our code, data and model are publicly released on GitHub (https://github.com/ZJUFanLab/TCMChat) and HuggingFace (https://huggingface.co/ZJUFanLab), providing high-quality knowledgebase for the research of TCM modernization with a user-friendly dialogue web tool.
突破性的大语言模型(LLMs)与对话系统相结合在医学领域的应用日益普遍。然而,尽管最近出现了一些中医大语言模型,但大语言模型在中医(TCM)方面的专业知识仍然有限。在此,我们介绍了中医聊天机器人(TCMChat,https://xomics.com.cn/tcmchat),这是一种通过在大规模精心策划的中医文本知识和中文问答(QA)数据集上进行预训练(PT)和监督微调(SFT)的生成式大语言模型。具体而言,我们首先通过文本挖掘和人工验证,编制了一个包含六种中医场景的定制集合作为训练集,涉及中医知识库、选择题、阅读理解、实体提取、医案诊断以及草药或方剂推荐。接下来,我们以百川2 - 7B - Chat为基础模型,对该模型进行预训练和监督微调。基准数据集和案例研究进一步证明了中医聊天机器人相对于现有模型的卓越性能。我们的代码、数据和模型在GitHub(https://github.com/ZJUFanLab/TCMChat)和HuggingFace(https://huggingface.co/ZJUFanLab)上公开发布,通过一个用户友好的对话网络工具为中医现代化研究提供高质量的知识库。