基于深度多任务学习的跨类型生物医学命名实体识别。
Cross-type biomedical named entity recognition with deep multi-task learning.
机构信息
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
出版信息
Bioinformatics. 2019 May 15;35(10):1745-1752. doi: 10.1093/bioinformatics/bty869.
MOTIVATION
State-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type.
RESULTS
We propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora.
AVAILABILITY AND IMPLEMENTATION
Our source code is available at https://github.com/yuzhimanhua/lm-lstm-crf.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
动机
最先进的生物医学命名实体识别 (BioNER) 系统通常需要针对每个实体类型(如基因、化学物质和疾病)的手工制作的特定特征。尽管最近的研究探索了使用神经网络模型进行 BioNER,以将专家从手动特征工程中解放出来,但性能仍然受到每种实体类型可用训练数据的限制。
结果
我们提出了一种多任务学习框架用于 BioNER,以共同使用不同类型实体的训练数据,并提高它们各自的性能。在 15 个基准 BioNER 数据集上的实验中,与最先进的 BioNER 系统和基线神经序列标记模型相比,我们的多任务模型实现了显著更好的性能。进一步的分析表明,性能的大幅提升来自于在不同标记语料库中相关生物医学实体之间共享字符级和单词级信息。
可用性和实现
我们的源代码可在 https://github.com/yuzhimanhua/lm-lstm-crf 获得。
补充信息
补充数据可在生物信息学在线获得。