Suppr超能文献

使用深度学习和启发式方法在 PubMed 全文文章中进行化学物质的识别和标引。

Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics.

机构信息

Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal.

Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain.

出版信息

Database (Oxford). 2022 Jul 1;2022. doi: 10.1093/database/baac047.

Abstract

The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available. Database URL https://github.com/bioinformatics-ua/biocreativeVII_track2.

摘要

在生物医药科学界,物品中化学物质的识别引起了极大的兴趣,因为这对于药物开发研究非常重要。之前的大多数研究都集中在 PubMed 摘要上,需要进一步使用全文文档进行调查,因为这些文档包含了必须探索的额外有价值的信息。后来,将医学主题词(MeSH)术语手动索引到这些文章中,有助于研究人员找到与其正在进行的工作最相关的出版物。BioCreative VII NLM-Chem 轨道促进了开发用于识别和索引 PubMed 全文文章中化学物质的系统。化学物质的识别包括识别化学物质的提及,并将这些提及与唯一的 MeSH 标识符联系起来。本文描述了我们的参与系统以及我们在挑战后所做的改进。我们提出了一个三阶段的管道,分别执行化学物质提及检测、实体标准化和索引。关于化学物质的识别,我们采用了一种深度学习解决方案,该解决方案利用了 PubMedBERT 的上下文嵌入,然后是多层感知机和条件随机场标记层。对于归一化方法,我们使用基于筛子的字典过滤,然后是深度学习相似性搜索策略。最后,对于索引,我们为每个文章开发了识别更相关 MeSH 代码的规则。在挑战期间,尽管在化学物质识别任务中的表现较低,但我们的系统在归一化和索引任务中获得了最佳的官方结果。在竞赛之后的阶段,我们通过使用其他技术改进我们的命名实体识别模型,提高了我们的结果。最终系统在化学识别、归一化和索引任务中的得分为 0.8731、0.8275 和 0.4849。可重现我们实验和运行管道的代码可在 https://github.com/bioinformatics-ua/biocreativeVII_track2 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace7/9248917/c3214e4afa44/baac047f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验