生物 BBC:一种增强生物医学实体检测的多特征模型。

BioBBC: a multi-feature model that enhances the detection of biomedical entities.

机构信息

Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

出版信息

Sci Rep. 2024 Apr 2;14(1):7697. doi: 10.1038/s41598-024-58334-x.

Abstract

The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.

摘要

生物医学出版物的快速增长需要高效的系统来自动处理非结构化文本中的生物医学命名实体识别 (BioNER) 任务。然而,由于生物医学实体名称的复杂性和缩写的频繁使用,准确地检测生物医学实体是相当具有挑战性的。在本文中,我们提出了 BioBBC,这是一个基于 BERT-BiLSTM-CRF 构建的利用多特征嵌入的深度学习 (DL) 模型,用于解决 BioNER 任务。BioBBC 由三个主要层组成;嵌入层、长短期记忆 (Bi-LSTM) 层和条件随机场 (CRF) 层。BioBBC 以生物医学领域的句子为输入,并识别文本中提到的生物医学实体。嵌入层通过学习四种类型的嵌入(词性标签 (POS) 嵌入、字符级嵌入、BERT 嵌入和特定于数据的嵌入)来生成输入的丰富上下文表示向量。BiLSTM 层生成额外的语法和语义特征表示。最后,CRF 层识别输入句子的最佳可能标签序列。我们的模型是为检测不同类型的生物医学实体而精心构建和优化的。基于实验结果,我们的模型在六个基准 BioNER 数据集上的表现优于最先进的 (SOTA) 模型,并取得了显著的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef54/10987643/20d135464a86/41598_2024_58334_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索