National Science Foundation Center for Big Learning, University of Florida, Gainesville, FL 32611, USA.
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA.
Bioinformatics. 2018 May 1;34(9):1547-1554. doi: 10.1093/bioinformatics/btx815.
Best performing named entity recognition (NER) methods for biomedical literature are based on hand-crafted features or task-specific rules, which are costly to produce and difficult to generalize to other corpora. End-to-end neural networks achieve state-of-the-art performance without hand-crafted features and task-specific knowledge in non-biomedical NER tasks. However, in the biomedical domain, using the same architecture does not yield competitive performance compared with conventional machine learning models.
We propose a novel end-to-end deep learning approach for biomedical NER tasks that leverages the local contexts based on n-gram character and word embeddings via Convolutional Neural Network (CNN). We call this approach GRAM-CNN. To automatically label a word, this method uses the local information around a word. Therefore, the GRAM-CNN method does not require any specific knowledge or feature engineering and can be theoretically applied to a wide range of existing NER problems. The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.26% on the Biocreative II dataset, 87.26% on the NCBI dataset and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods. To the best of our knowledge, we are the first to apply CNN based structures to BioNER problems.
The GRAM-CNN source code, datasets and pre-trained model are available online at: https://github.com/valdersoul/GRAM-CNN.
andyli@ece.ufl.edu or aconesa@ufl.edu.
Supplementary data are available at Bioinformatics online.
生物医学文献中表现最佳的命名实体识别(NER)方法基于手工制作的特征或特定于任务的规则,这些特征和规则的制作成本很高,并且难以推广到其他语料库。端到端神经网络在非生物医学 NER 任务中无需手工制作的特征和特定于任务的知识即可实现最先进的性能。然而,在生物医学领域,使用相同的架构与传统的机器学习模型相比,性能并不具有竞争力。
我们提出了一种新颖的端到端深度学习方法,用于生物医学 NER 任务,该方法利用基于 n-gram 字符和单词嵌入的局部上下文通过卷积神经网络(CNN)。我们称这种方法为 GRAM-CNN。为了自动标记一个单词,该方法使用单词周围的局部信息。因此,GRAM-CNN 方法不需要任何特定的知识或特征工程,并且可以在理论上应用于广泛的现有 NER 问题。GRAM-CNN 方法在包含不同 BioNER 实体的三个著名生物医学数据集上进行了评估。它在 Biocreative II 数据集上获得了 87.26%的 F1 分数,在 NCBI 数据集上获得了 87.26%的 F1 分数,在 JNLPBA 数据集上获得了 72.57%的 F1 分数。这些结果使 GRAM-CNN 在生物 NER 方法中处于领先地位。据我们所知,我们是第一个将基于 CNN 的结构应用于 BioNER 问题的人。
GRAM-CNN 的源代码、数据集和预训练模型可在以下网址获得:https://github.com/valdersoul/GRAM-CNN。
andyli@ece.ufl.edu 或 aconesa@ufl.edu。
补充数据可在生物信息学在线获得。