Suppr超能文献

深度学习架构在增强生物医学关系抽取中的应用:一种流水线方法。

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.

机构信息

Informatics Programs, University of Illinois Urbana-Champaign, 614 E Daniel Street, Champaign, IL 61820, United States.

School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States.

出版信息

Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.

Abstract

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

摘要

从科学出版物中提取生物医学关系是生物医学自然语言处理 (NLP) 的关键任务,它可以促进大型知识库的创建,实现更高效的知识发现,并加速证据综合。在本文中,我们基于在 BioCreative VIII BioRED 跟踪中所做的努力,提出了一种用于生物医学关系提取 (RE) 和新颖性检测 (ND) 的增强型端到端管道方法,该方法有效地利用了现有数据集并集成了最先进的深度学习方法。我们的管道由四个顺序执行的任务组成:命名实体识别 (NER)、实体链接 (EL)、RE 和 ND。我们使用作为共享任务基础的 BioRED 基准语料库训练模型。我们探索了每个任务的几种方法及其组合:对于 NER,我们比较了使用 BIO 方案的基于 BERT 的序列标记模型和跨度分类模型。对于 EL,我们训练了用于疾病和化学物质的卷积神经网络模型,并使用现有的工具 PubTator 3.0 来映射其他实体类型。对于 RE 和 ND,我们改编了基于 BERT 的、句子级别的 PURE 模型以进行双向和文档级提取。我们还进行了广泛的超参数调整以提高模型性能。我们使用基于 BERT 的模型在 NER、RE 和 ND 方面获得了最佳性能,并在 EL 方面使用了混合方法。与我们的共享任务提交相比,我们增强和优化的管道显示出了实质性的改进,NER:93.53(+3.09),EL:83.87(+9.73),RE:46.18(+15.67),ND:38.86(+14.9)。虽然 NER 和 EL 模型的性能相当高,但 RE 和 ND 任务在文档级别仍然具有挑战性。进一步增强数据集可以为实际应用提供更准确和有用的模型。我们在 https://github.com/janinaj/e2eBioMedRE/ 上提供了我们的模型和代码。数据库 URL:https://github.com/janinaj/e2eBioMedRE/。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/73fc/11352595/9c6b60ae2304/baae079f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验