Cheng Ning, Chen Yue, Gao Wanqing, Liu Jiajun, Huang Qunfu, Yan Cheng, Huang Xindi, Ding Changsong
School of Informatics, Hunan University of Chinese Medicine, Changsha, China.
Big Data Analysis Laboratory of Traditional Chinese Medicine, Hunan University of Chinese Medicine, Changsha, China.
Front Genet. 2021 Dec 22;12:807825. doi: 10.3389/fgene.2021.807825. eCollection 2021.
This study proposes an S-TextBLCNN model for the efficacy of traditional Chinese medicine (TCM) formula classification. This model uses deep learning to analyze the relationship between herb efficacy and formula efficacy, which is helpful in further exploring the internal rules of formula combination. First, for the TCM herbs extracted from , natural language processing (NLP) is used to learn and realize the quantitative expression of different TCM herbs. Three features of herb name, herb properties, and herb efficacy are selected to encode herbs and to construct formula-vector and herb-vector. Then, based on 2,664 formulae for stroke collected in TCM literature and 19 formula efficacy categories extracted from , an improved deep learning model TextBLCNN consists of a bidirectional long short-term memory (Bi-LSTM) neural network and a convolutional neural network (CNN) is proposed. Based on 19 formula efficacy categories, binary classifiers are established to classify the TCM formulae. Finally, aiming at the imbalance problem of formula data, the over-sampling method SMOTE is used to solve it and the S-TextBLCNN model is proposed. The formula-vector composed of herb efficacy has the best effect on the classification model, so it can be inferred that there is a strong relationship between herb efficacy and formula efficacy. The TextBLCNN model has an accuracy of 0.858 and an F-score of 0.762, both higher than the logistic regression (acc = 0.561, F-score = 0.567), SVM (acc = 0.703, F-score = 0.591), LSTM (acc = 0.723, F-score = 0.621), and TextCNN (acc = 0.745, F-score = 0.644) models. In addition, the over-sampling method SMOTE is used in our model to tackle data imbalance, and the F-score is greatly improved by an average of 47.1% in 19 models. The combination of formula feature representation and the S-TextBLCNN model improve the accuracy in formula efficacy classification. It provides a new research idea for the study of TCM formula compatibility.
本研究提出了一种用于中医方剂功效分类的S-TextBLCNN模型。该模型利用深度学习分析中药药效与方剂功效之间的关系,有助于进一步探索方剂配伍的内在规律。首先,对于从……中提取的中药,运用自然语言处理(NLP)进行学习,实现不同中药的量化表达。选取中药名称、药性、药效三个特征对中药进行编码,构建方剂向量和中药向量。然后,基于中医文献中收集的2664个中风方剂以及从……中提取的19种方剂功效类别,提出了一种由双向长短期记忆(Bi-LSTM)神经网络和卷积神经网络(CNN)组成的改进深度学习模型TextBLCNN。基于19种方剂功效类别,建立二元分类器对方剂进行分类。最后,针对方剂数据不均衡问题,采用过采样方法SMOTE进行解决,提出了S-TextBLCNN模型。由中药药效组成的方剂向量对方剂分类模型效果最佳,由此可推断中药药效与方剂功效之间存在紧密关系。TextBLCNN模型的准确率为0.858,F值为0.762,均高于逻辑回归(acc = 0.561,F值 = 0.567)、支持向量机(acc = 0.703,F值 = 0.591)、长短期记忆网络(LSTM,acc = 0.723,F值 = 0.621)和文本卷积神经网络(TextCNN,acc = 0.745,F值 = 0.644)模型。此外,我们的模型采用过采样方法SMOTE处理数据不均衡问题,在19个模型中F值平均大幅提高了47.1%。方剂特征表示与S-TextBLCNN模型相结合提高了方剂功效分类的准确率。为中医方剂配伍研究提供了新的研究思路。