• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种生物医学文章中命名实体规范化的方法:应用于疾病和植物

A method for named entity normalization in biomedical articles: application to diseases and plants.

作者信息

Cho Hyejin, Choi Wonjun, Lee Hyunju

机构信息

School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Chemdangwagi-ro, Buk-gu, Gwangju, Republic of Korea.

出版信息

BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.

DOI:10.1186/s12859-017-1857-8
PMID:29029598
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5640957/
Abstract

BACKGROUND

In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine's Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems.

RESULTS

In this study, we propose an approach for normalizing biological entities, such as disease names and plant names, by using word embeddings to represent semantic spaces. For diseases, training data from the National Center for Biotechnology Information (NCBI) disease corpus and unlabeled data from PubMed abstracts were used to construct word representations. For plants, a training corpus that we manually constructed and unlabeled PubMed abstracts were used to represent word vectors. We showed that the proposed approach performed better than the use of only the training corpus or only the unlabeled data and showed that the normalization accuracy was improved by using our model even when the dictionaries were not comprehensive. We obtained F-scores of 0.808 and 0.690 for normalizing the NCBI disease corpus and manually constructed plant corpus, respectively. We further evaluated our approach using a data set in the disease normalization task of the BioCreative V challenge. When only the disease corpus was used as a dictionary, our approach significantly outperformed the best system of the task.

CONCLUSIONS

The proposed approach shows robust performance for normalizing biological entities. The manually constructed plant corpus and the proposed model are available at http://gcancer.org/plant and http://gcancer.org/normalization , respectively.

摘要

背景

在生物医学文章中,一种从文本中识别实体名称的命名实体识别(NER)技术是从文章中提取生物知识的重要元素。在将NER应用于文章之后,下一步是将识别出的名称规范化为标准概念(即疾病名称映射到美国国立医学图书馆的医学主题词疾病术语)。在生物医学文章中,许多实体规范化方法依赖于特定领域的词典来解决同义词和缩写问题。然而,除了一些实体(如基因)外,这些词典并不全面。近年来,生物医学文章迅速积累,基于神经网络的算法结合大量未标记数据在几个自然语言处理问题上取得了显著成功。

结果

在本研究中,我们提出了一种通过使用词嵌入来表示语义空间来规范化生物实体(如疾病名称和植物名称)的方法。对于疾病,使用来自美国国立生物技术信息中心(NCBI)疾病语料库的训练数据和来自PubMed摘要的未标记数据来构建词表示。对于植物,使用我们手动构建的训练语料库和未标记的PubMed摘要来表示词向量。我们表明,所提出的方法比仅使用训练语料库或仅使用未标记数据的方法表现更好,并且表明即使词典不全面,使用我们的模型也能提高规范化准确性。对于NCBI疾病语料库和手动构建的植物语料库的规范化,我们分别获得了0.808和0.690的F分数。我们使用BioCreative V挑战赛疾病规范化任务中的数据集进一步评估了我们的方法。当仅将疾病语料库用作词典时,我们的方法显著优于该任务的最佳系统。

结论

所提出的方法在规范化生物实体方面表现出强大的性能。手动构建的植物语料库和所提出的模型分别可在http://gcancer.org/plant和http://gcancer.org/normalization上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/ac511fe5ea65/12859_2017_1857_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/5b59f38a9332/12859_2017_1857_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/7e9c5bca9d5f/12859_2017_1857_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/20b7d7fc0489/12859_2017_1857_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/d16831c72337/12859_2017_1857_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/ac511fe5ea65/12859_2017_1857_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/5b59f38a9332/12859_2017_1857_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/7e9c5bca9d5f/12859_2017_1857_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/20b7d7fc0489/12859_2017_1857_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/d16831c72337/12859_2017_1857_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d9e/5640957/ac511fe5ea65/12859_2017_1857_Fig5_HTML.jpg

相似文献

1
A method for named entity normalization in biomedical articles: application to diseases and plants.一种生物医学文章中命名实体规范化的方法:应用于疾病和植物
BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.
2
NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.NLM-Chem-BC7:用于生物医学文章中化学实体注释和索引的人工标注全文资源。
Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.
3
Linking entities through an ontology using word embeddings and syntactic re-ranking.通过使用词向量和句法重新排序将实体链接到本体中。
BMC Bioinformatics. 2019 Mar 27;20(1):156. doi: 10.1186/s12859-019-2678-8.
4
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
5
Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。
BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.
6
Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征(BERT)模型进行微调:一项实证研究。
JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.
7
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
8
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
9
Integrating various resources for gene name normalization.整合各种资源进行基因名称标准化。
PLoS One. 2012;7(9):e43558. doi: 10.1371/journal.pone.0043558. Epub 2012 Sep 12.
10
Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.全文文章中的化学物质鉴定与标引:NLM-Chem 在 BioCreative VII 挑战赛中的概述
Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.

引用本文的文献

1
Plant attribute extraction: An enhancing three-stage deep learning model for relational triple extraction.植物属性提取:一种用于关系三元组提取的增强型三阶段深度学习模型。
PLoS One. 2025 Jul 8;20(7):e0327186. doi: 10.1371/journal.pone.0327186. eCollection 2025.
2
HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.
3
Public data sources for regulatory genomic features.

本文引用的文献

1
Challenges in clinical natural language processing for automated disorder normalization.临床自然语言处理中自动疾病标准化的挑战。
J Biomed Inform. 2015 Oct;57:28-37. doi: 10.1016/j.jbi.2015.07.010. Epub 2015 Jul 14.
2
tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem:一种用于化学命名实体识别和标准化的高性能方法。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.
3
NCBI disease corpus: a resource for disease name recognition and concept normalization.
监管基因组特征的公共数据源。
Med Genet. 2021 Aug 14;33(2):167-177. doi: 10.1515/medgen-2021-2075. eCollection 2021 Jun.
4
NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph.NetMe 2.0:一个基于网络的平台,用于从生物医学文献中提取和构建知识为标记图。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae194.
5
Optimizing Signal Management in a Vaccine Adverse Event Reporting System: A Proof-of-Concept with COVID-19 Vaccines Using Signs, Symptoms, and Natural Language Processing.优化疫苗不良事件报告系统中的信号管理:使用体征、症状和自然语言处理对 COVID-19 疫苗进行概念验证
Drug Saf. 2024 Feb;47(2):173-182. doi: 10.1007/s40264-023-01381-6. Epub 2023 Dec 7.
6
AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning.AIONER:基于整体方案的深度学习生物医学命名实体识别。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad310.
7
Edge Weight Updating Neural Network for Named Entity Normalization.用于命名实体规范化的边权重更新神经网络
Neural Process Lett. 2022 Dec 21:1-22. doi: 10.1007/s11063-022-11102-2.
8
Surgical procedure long terms recognition from Chinese literature incorporating structural feature.结合结构特征从中国文献中获得手术程序的长期认知。
Heliyon. 2022 Oct 29;8(11):e11291. doi: 10.1016/j.heliyon.2022.e11291. eCollection 2022 Nov.
9
Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes.植物表型关系语料库,用于描述植物和表型之间的生物医学关系。
Sci Data. 2022 May 26;9(1):235. doi: 10.1038/s41597-022-01350-1.
10
Natural Language Processing Algorithms for Normalizing Expressions of Synonymous Symptoms in Traditional Chinese Medicine.用于规范化中医同义症状表达的自然语言处理算法
Evid Based Complement Alternat Med. 2021 Oct 11;2021:6676607. doi: 10.1155/2021/6676607. eCollection 2021.
NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
4
A modular framework for biomedical concept recognition.生物医学概念识别的模块化框架。
BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.
5
DNorm: disease name normalization with pairwise learning to rank.DNorm:基于对分学习排序的疾病名称标准化。
Bioinformatics. 2013 Nov 15;29(22):2909-17. doi: 10.1093/bioinformatics/btt474. Epub 2013 Aug 21.
6
PubTator: a web-based text mining tool for assisting biocuration.PubTator:一个用于辅助生物注释的基于网络的文本挖掘工具。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W518-22. doi: 10.1093/nar/gkt441. Epub 2013 May 22.
7
Gimli: open source and high-performance biomedical name recognition.金雳:开源的高性能生物医学命名实体识别。
BMC Bioinformatics. 2013 Feb 15;14:54. doi: 10.1186/1471-2105-14-54.
8
ChemSpot: a hybrid system for chemical named entity recognition.ChemSpot:一种用于化学命名实体识别的混合系统。
Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.
9
MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database.医学:比较毒理学基因组学数据库中使用的实用疾病词汇。
Database (Oxford). 2012 Mar 20;2012:bar065. doi: 10.1093/database/bar065. Print 2012.
10
Cross-species gene normalization by species inference.物种推断的跨物种基因标准化。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S5. doi: 10.1186/1471-2105-12-S8-S5.