Hou Jingming, Saad Saidah, Omar Nazlia
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.
PeerJ Comput Sci. 2024 May 31;10:e2022. doi: 10.7717/peerj-cs.2022. eCollection 2024.
Our study focuses on Traditional Chinese Medical (TCM) named entity recognition (NER), which involves identifying and extracting specific entity names from TCM record. This task has significant implications for doctors and researchers, as it enables the automated identification of relevant TCM terms, ultimately enhancing research efficiency and accuracy. However, the current Bidirectional Encoder Representations from Transformers-Long Short Term Memory-Conditional Random Fields (BERT-LSTM-CRF) model for TCM NER is constrained by a traditional structure, limiting its capacity to fully harness the advantages provided by Bidirectional Encoder Representations from Transformers (BERT) and long short term memory (LSTM) models. Through comparative experiments, we also observed that the straightforward superimposition of models actually leads to a decrease in recognition results. To optimize the structure of the traditional BERT-BiLSTM-CRF model and obtain more effective text representations, we propose the Dyn-Att Net model, which introduces dynamic attention and a parallel structure. By integrating BERT and LSTM models with the dynamic attention mechanism, our model effectively captures semantic, contextual, and sequential relations within text sequences, resulting in high accuracy. To validate the effectiveness of our model, we compared it with nine other models in TCM dataset namely the publicly available PaddlePaddle dataset. Our Dyn-Att Net model, based on BERT, outperforms the other models, achieving an F1 score of 81.91%, accuracy of 92.06%, precision of 80.26%, and recall of 83.76%. Furthermore, its robust generalization capability is substantiated through validation on the APTNER, MSRA, and EduNER datasets. Overall, the Dyn-Att Net model not only enhances NER accuracy within the realm of traditional Chinese medicine, but also showcases considerable potential for cross-domain generalization. Moreover, the Dyn-Att Net model's parallel architecture facilitates efficient computation, contributing to time-saving efforts in NER tasks.
我们的研究聚焦于中医命名实体识别(NER),即从中医记录中识别并提取特定实体名称。这项任务对医生和研究人员具有重要意义,因为它能实现对相关中医术语的自动识别,最终提高研究效率和准确性。然而,当前用于中医NER的双向编码器表征来自变换器-长短期记忆-条件随机场(BERT-LSTM-CRF)模型受到传统结构的限制,限制了其充分利用双向编码器表征来自变换器(BERT)和长短期记忆(LSTM)模型优势的能力。通过对比实验,我们还观察到模型的直接叠加实际上会导致识别结果下降。为了优化传统BERT-BiLSTM-CRF模型的结构并获得更有效的文本表征,我们提出了动态注意力网络(Dyn-Att Net)模型,该模型引入了动态注意力和并行结构。通过将BERT和LSTM模型与动态注意力机制相结合,我们的模型有效地捕捉了文本序列中的语义、上下文和顺序关系,从而实现了高精度。为了验证我们模型的有效性,我们将其与中医数据集中的其他九个模型进行了比较,即公开可用的飞桨数据集。我们基于BERT的Dyn-Att Net模型优于其他模型,F1分数达到81.91%,准确率为92.06%,精确率为80.26%,召回率为83.76%。此外,通过在APTNER、MSRA和EduNER数据集上的验证,证实了其强大的泛化能力。总体而言,Dyn-Att Net模型不仅提高了中医领域内的NER准确性,还展现出了相当大的跨域泛化潜力。此外,Dyn-Att Net模型的并行架构有助于高效计算,在NER任务中节省了时间。