Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark.
TurkuNLP Group, Department of Computing, University of Turku, Turku, 20014, Finland.
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.
Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly.
In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%).
All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.
基于字典的命名实体识别(NER)允许在语料库中检测到术语,并将其标准化到生物医学数据库和本体中。然而,适应不同的实体类型需要新的高质量字典和与每个类型相关的阻止名称列表。到目前为止,这些列表是通过识别通过手动检查单个名称会导致许多误报的情况来创建的,这个过程的规模很小。
在这项工作中,我们旨在通过自动识别要阻止的名称来改进阻止列表,这些名称是基于它们出现的上下文识别的。通过比较三种成熟的生物医学 NER 方法的结果,我们生成了一个超过 1250 万条文本跨度的数据集,其中方法在标记的实体边界和类型上达成一致。这些用于为四种实体类型(基因、疾病、物种和化学物质)生成上下文的正例和负例,这些正例和负例用于训练基于 Transformer 的模型(BioBERT)来执行实体类型分类。应用最佳模型(F1 分数=96.7%)使我们能够生成一个应阻止的有问题名称列表。将其引入我们的系统将语料库范围内的阻止名称列表的大小增加了一倍。此外,我们生成了一个特定于文档的列表,允许在特定文档中阻止模棱两可的名称。这些更改平均将文本挖掘精度提高了约 5.5%,对于化学物质和基因名称则提高了 8.5%以上,这对使用此 NER 系统的几个生物数据库产生了积极影响,例如 STRING 数据库,召回率仅略有下降(0.6%)。
所有资源均可通过 Zenodo https://doi.org/10.5281/zenodo.11243139 和 GitHub https://doi.org/10.5281/zenodo.10289360 获得。