IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):841-846. doi: 10.1109/TCBB.2018.2868346. Epub 2018 Sep 3.
Biomedical named entity recognition (Bio-NER) is an important preliminary step for many biomedical text mining tasks. The current mainstream methods for NER are based on the neural networks to avoid the complex hand-designed features derived from various linguistic analyses. However, these methods ignore some potential sentence-level semantic information and general features of semantic and syntactic. Therefore, we propose a novel Long Short Term Memory (LSTM) Networks model integrating language model and sentence-level reading control gate (LS-BLSTM-CRF) for Bio-NER. In our model, a sentence-level reading control gate (SC) is inserted into the networks to integrate the implicit meaning of an entire sentence and the language model is integrated to our model to learn richer potential features. Besides, character-level embeddings are introduced as the input to deal with out-of-vocabulary words. The experimental results conducted on the BioCreative II GM corpus show that our method can achieve an F-score of 89.94 percent, which outperforms all state-of-the-art systems and is 1.33 percent higher than the best performing neural networks.
生物医学命名实体识别 (Bio-NER) 是许多生物医学文本挖掘任务的重要初步步骤。当前 NER 的主流方法基于神经网络,以避免源自各种语言分析的复杂手动设计特征。然而,这些方法忽略了一些潜在的句子级语义信息和语义和句法的一般特征。因此,我们提出了一种新的长短期记忆 (LSTM) 网络模型,该模型集成了语言模型和句子级阅读控制门 (LS-BLSTM-CRF) ,用于 Bio-NER。在我们的模型中,插入了一个句子级阅读控制门 (SC) 来整合整个句子的隐含意义,并将语言模型集成到我们的模型中,以学习更丰富的潜在特征。此外,还引入了字符级嵌入作为输入来处理词汇外单词。在 BioCreative II GM 语料库上进行的实验结果表明,我们的方法可以达到 89.94%的 F 分数,优于所有最先进的系统,比表现最好的神经网络高出 1.33%。