Technol Health Care. 2023;31(S1):111-121. doi: 10.3233/THC-236011.
With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification.
The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained.
Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences.
The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types.
The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
随着生物医学文献数量的指数级增长,文本挖掘任务在医学领域变得越来越重要。命名实体是文本挖掘的主要识别任务,是构建医学领域知识图谱、医学问答系统、医学文本分类等的前提和关键部分。
本研究旨在通过融合多特征嵌入来有效地识别生物医学实体。多个特征提供了更全面的信息,从而可以获得更好的预测结果。
首先,生成三种不同类型的特征,包括深度上下文单词级特征、局部字符级特征和单词表示层的词性特征。将单词表示向量作为特征输入到 BiLSTM 中,以获取依赖信息。最后,使用 CRF 算法学习状态序列的特征,以获得全局最优标记序列。
实验结果表明,在六个数据集的八个生物医学实体类型中的四个中,该模型在所有方面的性能均优于其他最先进的方法。
所提出的方法对预测结果有积极影响。它通过融合多特征嵌入来增强语义信息,从而全面考虑了命名实体识别的相关因素。