Luo Jigen, Xiong Wangping, Du Jianqiang, Liu Yingfeng, Li Jianwen, Hu Dingxing
School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, Jiangxi, China.
Qihuang Academy, Jiangxi University of Chinese Medicine, Nanchang 330004, Jiangxi, China.
Evid Based Complement Alternat Med. 2021 Nov 29;2021:2337924. doi: 10.1155/2021/2337924. eCollection 2021.
The text similarity calculation plays a crucial role as the core work of artificial intelligence commercial applications such as traditional Chinese medicine (TCM) auxiliary diagnosis, intelligent question and answer, and prescription recommendation. However, TCM texts have problems such as short sentence expression, inaccurate word segmentation, strong semantic relevance, high feature dimension, and sparseness. This study comprehensively considers the temporal information of sentence context and proposes a TCM text similarity calculation model based on the bidirectional temporal Siamese network (BTSN). We used the enhanced representation through knowledge integration (ERNIE) pretrained language model to train character vectors instead of word vectors and solved the problem of inaccurate word segmentation in TCM. In the Siamese network, the traditional fully connected neural network was replaced by a deep bidirectional long short-term memory (BLSTM) to capture the contextual semantics of the current word information. The improved similarity BLSTM was used to map the sentence that is to be tested into two sets of low-dimensional numerical vectors. Then, we performed similarity calculation training. Experiments on the two datasets of financial and TCM show that the performance of the BTSN model in this study was better than that of other similarity calculation models. When the number of layers of the BLSTM reached 6 layers, the accuracy of the model was the highest. This verifies that the text similarity calculation model proposed in this study has high engineering value.
文本相似度计算作为中医辅助诊断、智能问答、方剂推荐等人工智能商业应用的核心工作,发挥着至关重要的作用。然而,中医文本存在句子表达简短、分词不准确、语义相关性强、特征维度高以及稀疏性等问题。本研究综合考虑句子上下文的时间信息,提出了一种基于双向时间孪生网络(BTSN)的中医文本相似度计算模型。我们使用通过知识整合(ERNIE)预训练语言模型来训练字符向量而非词向量,解决了中医分词不准确的问题。在孪生网络中,传统的全连接神经网络被深度双向长短期记忆(BLSTM)所取代,以捕捉当前词信息的上下文语义。改进后的相似度BLSTM用于将待测试句子映射为两组低维数值向量。然后,我们进行相似度计算训练。在金融和中医两个数据集上的实验表明,本研究中的BTSN模型性能优于其他相似度计算模型。当BLSTM层数达到6层时,模型的准确率最高。这验证了本研究提出的文本相似度计算模型具有较高的工程价值。