College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
Beijing Institute of Health Administration and Medical Information, Beijing 100850, China.
Bioinformatics. 2018 Apr 15;34(8):1381-1388. doi: 10.1093/bioinformatics/btx761.
In biomedical research, chemical is an important class of entities, and chemical named entity recognition (NER) is an important task in the field of biomedical information extraction. However, most popular chemical NER methods are based on traditional machine learning and their performances are heavily dependent on the feature engineering. Moreover, these methods are sentence-level ones which have the tagging inconsistency problem.
In this paper, we propose a neural network approach, i.e. attention-based bidirectional Long Short-Term Memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leverages document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. It achieves better performances with little feature engineering than other state-of-the-art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (the F-scores of 91.14 and 92.57%, respectively).
Data and code are available at https://github.com/lingluodlut/Att-ChemdNER.
yangzh@dlut.edu.cn or wangleibihami@gmail.com.
Supplementary data are available at Bioinformatics online.
在生物医学研究中,化学是一类重要的实体,化学命名实体识别(NER)是生物医学信息提取领域的一项重要任务。然而,大多数流行的化学 NER 方法基于传统的机器学习,其性能严重依赖于特征工程。此外,这些方法是基于句子级别的,存在标签不一致的问题。
在本文中,我们提出了一种基于神经网络的方法,即基于注意力的双向长短时记忆与条件随机场层(Att-BiLSTM-CRF),用于文档级别的化学 NER。该方法利用注意力机制获得的文档级全局信息,强制对文档中同一标记的多个实例进行标签一致性。与其他最先进的方法相比,该方法在 BioCreative IV 化学化合物和药物名称识别(CHEMDNER)语料库和 BioCreative V 化学-疾病关系(CDR)任务语料库上取得了更好的性能,无需进行大量特征工程(F 分数分别为 91.14%和 92.57%)。
数据和代码可在 https://github.com/lingluodlut/Att-ChemdNER 上获取。
yangzh@dlut.edu.cn 或 wangleibihami@gmail.com。
补充数据可在生物信息学在线获得。