基于 BERT 的生物医学文本依存句法分析。

Dependency parsing of biomedical text with BERT.

机构信息

TurkuNLP Group, University of Turku, Turku, Finland.

出版信息

BMC Bioinformatics. 2020 Dec 29;21(Suppl 23):580. doi: 10.1186/s12859-020-03905-8.

DOI:10.1186/s12859-020-03905-8

PMID:33372589

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7771067/

Abstract

BACKGROUND

: Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine.

METHODS

: We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing.

RESULTS

: We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.

摘要

背景

句法分析，或解析，是自然语言处理中的一项关键任务，也是许多文本挖掘方法的必备组件。近年来，通用依存关系 (UD) 已成为依存解析的主要形式。虽然近年来有许多以 UD 为中心的任务极大地推动了多语言解析的发展，但对于生物医学等专业领域的文本解析的研究却很少。

方法

我们使用最近引入的 CRAFT-SA 共享任务数据集，探索将最先进的神经依存解析方法应用于生物医学文本。CRAFT-SA 任务广泛遵循 UD 表示法和最新的 UD 任务约定，使我们能够针对该任务微调与 UD 兼容的图尔库神经解析器和 UDify 神经解析器。我们进一步通过广泛选择 BERT 模型评估迁移学习的效果，包括专门针对生物医学文本处理预训练的几个模型。