Suppr超能文献

一种使用大语言模型的中医方剂分类加权投票方法:算法开发与验证研究

A Weighted Voting Approach for Traditional Chinese Medicine Formula Classification Using Large Language Models: Algorithm Development and Validation Study.

作者信息

Wang Zhe, Li Keqian, Peng Suyuan, Liu Lihong, Yang Xiaolin, Yao Keyu, Herre Heinrich, Zhu Yan

机构信息

Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences; School of Basic Medicine, Peking Union Medical College, Beijing, China.

Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Leipzig, Germany.

出版信息

JMIR Med Inform. 2025 Jul 24;13:e69286. doi: 10.2196/69286.

Abstract

BACKGROUND

Several clinical cases and experiments have demonstrated the effectiveness of traditional Chinese medicine (TCM) formulas in treating and preventing diseases. These formulas contain critical information about their ingredients, efficacy, and indications. Classifying TCM formulas based on this information can effectively standardize TCM formulas management, support clinical and research applications, and promote the modernization and scientific use of TCM. To further advance this task, TCM formulas can be classified using various approaches, including manual classification, machine learning, and deep learning. Additionally, large language models (LLMs) are gaining prominence in the biomedical field. Integrating LLMs into TCM research could significantly enhance and accelerate the discovery of TCM knowledge by leveraging their advanced linguistic understanding and contextual reasoning capabilities.

OBJECTIVE

The objective of this study is to evaluate the performance of different LLMs in the TCM formula classification task. Additionally, by employing ensemble learning with multiple fine-tuned LLMs, this study aims to enhance classification accuracy.

METHODS

The data for the TCM formula were manually refined and cleaned. We selected 10 LLMs that support Chinese for fine-tuning. We then employed an ensemble learning approach that combined the predictions of multiple models using both hard and weighted voting, with weights determined by the average accuracy of each model. Finally, we selected the top 5 most effective models from each series of LLMs for weighted voting (top 5) and the top 3 most accurate models of 10 for weighted voting (top 3).

RESULTS

A total of 2441 TCM formulas were curated manually from multiple sources, including the Coding Rules for Chinese Medicinal Formulas and Their Codes, the Chinese National Medical Insurance Catalog for proprietary Chinese medicines, textbooks of TCM formulas, and TCM literature. The dataset was divided into a training set of 1999 TCM formulas and test set of 442 TCM formulas. The testing results showed that Qwen-14B achieved the highest accuracy of 75.32% among the single models. The accuracy rates for hard voting, weighted voting, weighted voting (top 5), and weighted voting (top 3) were 75.79%, 76.47%, 75.57%, and 77.15%, respectively.

CONCLUSIONS

This study aims to explore the effectiveness of LLMs in the TCM formula classification task. To this end, we propose an ensemble learning method that integrates multiple fine-tuned LLMs through a voting mechanism. This method not only improves classification accuracy but also enhances the existing classification system for classifying the efficacy of TCM formula.

摘要

背景

多个临床病例和实验已证明中药方剂在治疗和预防疾病方面的有效性。这些方剂包含有关其成分、功效和适应症的关键信息。基于这些信息对中药方剂进行分类,可以有效规范中药方剂管理,支持临床和研究应用,促进中药的现代化和科学应用。为了进一步推进这项工作,可以使用多种方法对中药方剂进行分类,包括人工分类、机器学习和深度学习。此外,大语言模型(LLMs)在生物医学领域正日益受到关注。将大语言模型整合到中医研究中,可以利用其先进的语言理解和上下文推理能力,显著增强并加速中医知识的发现。

目的

本研究的目的是评估不同大语言模型在中药方剂分类任务中的性能。此外,通过对多个微调后的大语言模型采用集成学习,本研究旨在提高分类准确率。

方法

对中药方剂数据进行人工提炼和清理。我们选择了10个支持中文的大语言模型进行微调。然后我们采用一种集成学习方法,使用硬投票和加权投票相结合的方式来合并多个模型的预测结果,权重由每个模型的平均准确率确定。最后,我们从每个系列的大语言模型中选出最有效的前5个模型进行加权投票(前5),以及从10个模型中选出最准确的前3个模型进行加权投票(前3)。

结果

从多个来源人工整理了总共2441个中药方剂,包括《中药方剂编码规则及代码》、《中国国家医保中成药目录》、中药方剂教材和中医文献。数据集被分为一个包含1999个中药方剂的训练集和一个包含442个中药方剂的测试集。测试结果表明,在单个模型中,Qwen-14B达到了最高准确率75.32%。硬投票、加权投票、加权投票(前5)和加权投票(前3)的准确率分别为75.79%、76.47%、75.57%和77.15%。

结论

本研究旨在探索大语言模型在中药方剂分类任务中的有效性。为此,我们提出了一种集成学习方法,通过投票机制整合多个微调后的大语言模型。该方法不仅提高了分类准确率,还增强了现有的中药方剂功效分类系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24ab/12292024/2bb88f4b89e2/medinform-v13-e69286-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验