• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DNorm:基于对分学习排序的疾病名称标准化。

DNorm: disease name normalization with pairwise learning to rank.

机构信息

National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894, USA and Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd, Scottsdale, AZ 85259, USA.

出版信息

Bioinformatics. 2013 Nov 15;29(22):2909-17. doi: 10.1093/bioinformatics/btt474. Epub 2013 Aug 21.

DOI:10.1093/bioinformatics/btt474
PMID:23969135
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3810844/
Abstract

MOTIVATION

Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research.

METHODS

In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval.

RESULTS

We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.

AVAILABILITY

The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator .

摘要

动机

尽管疾病在生物医学研究中起着核心作用,但与生物医学文本挖掘研究中的其他规范化任务相比,自动确定文本中提到的疾病的尝试要少得多,即疾病名称规范化(DNorm)任务。

方法

在本文中,我们介绍了用于 DNorm 的第一个机器学习方法,该方法使用 NCBI 疾病语料库和 MEDIC 词汇表,该词汇表结合了 MeSH®和 OMIM。我们的方法是一种高性能的、基于数学原理的框架,用于直接从训练数据中学习提及和概念名称之间的相似性。该技术基于对排序的成对学习,以前没有应用于规范化任务,但在信息检索的大型优化问题中已被证明是成功的。

结果

我们将我们的方法与基于词汇规范化和匹配的几种技术、MetaMap 和 Lucene 进行了比较。我们的算法实现了 0.782 的微平均 F1 度量和 0.809 的宏平均 F1 度量,分别比性能最高的基线方法提高了 0.121 和 0.098。

可用性

DNorm 的源代码可在 http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm 上获得,同时提供基于网络的演示以及与 NCBI 疾病语料库的链接。PubMed 摘要上的结果可在 PubTator 上获得:http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/8eed18fdbf11/btt474f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/3c9e5b73d9d6/btt474f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/2f69160e90d0/btt474f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/8eed18fdbf11/btt474f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/3c9e5b73d9d6/btt474f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/2f69160e90d0/btt474f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcb2/3810844/8eed18fdbf11/btt474f3p.jpg

相似文献

1
DNorm: disease name normalization with pairwise learning to rank.DNorm:基于对分学习排序的疾病名称标准化。
Bioinformatics. 2013 Nov 15;29(22):2909-17. doi: 10.1093/bioinformatics/btt474. Epub 2013 Aug 21.
2
Challenges in clinical natural language processing for automated disorder normalization.临床自然语言处理中自动疾病标准化的挑战。
J Biomed Inform. 2015 Oct;57:28-37. doi: 10.1016/j.jbi.2015.07.010. Epub 2015 Jul 14.
3
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
4
tmBioC: improving interoperability of text-mining tools with BioC.tmBioC:提高文本挖掘工具与BioC的互操作性。
Database (Oxford). 2014 Jul 25;2014. doi: 10.1093/database/bau073. Print 2014.
5
Beyond accuracy: creating interoperable and scalable text-mining web services.超越准确性:创建可互操作且可扩展的文本挖掘网络服务。
Bioinformatics. 2016 Jun 15;32(12):1907-10. doi: 10.1093/bioinformatics/btv760. Epub 2016 Feb 16.
6
tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem:一种用于化学命名实体识别和标准化的高性能方法。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.
7
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
8
PubTator: a web-based text mining tool for assisting biocuration.PubTator:一个用于辅助生物注释的基于网络的文本挖掘工具。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W518-22. doi: 10.1093/nar/gkt441. Epub 2013 May 22.
9
OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器:在生物医学文献中检测、规范和定位生物实体。
Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.
10
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

1
Do LLMs Surpass Encoders for Biomedical NER?大型语言模型在生物医学命名实体识别方面是否超越了编码器?
Proc (IEEE Int Conf Healthc Inform). 2025 Jun;2025:352-358. doi: 10.1109/ICHI64645.2025.00048. Epub 2025 Jul 22.
2
Darling (v2.0): Mining disease-related databases for the detection of biomedical entity associations.达林(v2.0):挖掘疾病相关数据库以检测生物医学实体关联。
Comput Struct Biotechnol J. 2025 Jun 14;27:2626-2637. doi: 10.1016/j.csbj.2025.06.025. eCollection 2025.
3
Annotated corpus for traditional formula-disease relationships in biomedical articles.

本文引用的文献

1
PubTator: a web-based text mining tool for assisting biocuration.PubTator:一个用于辅助生物注释的基于网络的文本挖掘工具。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W518-22. doi: 10.1093/nar/gkt441. Epub 2013 May 22.
2
Collaborative biocuration--text-mining development task for document prioritization for curation.协作生物注释——用于文档优先级排序的文本挖掘开发任务,以便进行注释。
Database (Oxford). 2012 Nov 22;2012:bas037. doi: 10.1093/database/bas037. Print 2012.
3
Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information.
生物医学文章中传统方剂 - 疾病关系的注释语料库。
Sci Data. 2025 Jan 7;12(1):26. doi: 10.1038/s41597-025-04377-2.
4
Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: A data-driven approach for improved classification.利用隐私保护的大型语言模型和多类型标注增强胸部 X 光数据集:一种用于提高分类性能的数据驱动方法。
Med Image Anal. 2025 Jan;99:103383. doi: 10.1016/j.media.2024.103383. Epub 2024 Nov 10.
5
Tracking the Spread of Pollen on Social Media Using Pollen-Related Messages From Twitter: Retrospective Analysis.利用 Twitter 上与花粉相关的信息追踪花粉在社交媒体上的传播:回顾性分析。
J Med Internet Res. 2024 Oct 21;26:e58309. doi: 10.2196/58309.
6
Improving dictionary-based named entity recognition with deep learning.利用深度学习改进基于字典的命名实体识别。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.
7
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用:一种流水线方法。
Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.
8
AI-based disease category prediction model using symptoms from low-resource Ethiopian language: Afaan Oromo text.基于人工智能的疾病类别预测模型,利用来自资源匮乏的埃塞俄比亚语言(阿法尔语)的症状文本。
Sci Rep. 2024 May 16;14(1):11233. doi: 10.1038/s41598-024-62278-7.
9
GPDminer: a tool for extracting named entities and analyzing relations in biological literature.GPDminer:一种用于从生物文献中提取命名实体和分析关系的工具。
BMC Bioinformatics. 2024 Mar 6;25(1):101. doi: 10.1186/s12859-024-05710-z.
10
Semantics-enabled biomedical literature analytics.支持语义分析的生物医学文献分析
J Biomed Inform. 2024 Feb;150:104588. doi: 10.1016/j.jbi.2024.104588. Epub 2024 Jan 19.
利用语义信息为比较毒理学基因组数据库对 PubMed 文章进行优先级排序。
Database (Oxford). 2012 Nov 17;2012:bas042. doi: 10.1093/database/bas042. Print 2012.
4
Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts.利用文本挖掘工具加速文献整理:以 PubTator 在 PubMed 摘要中整理基因为例。
Database (Oxford). 2012 Nov 17;2012:bas041. doi: 10.1093/database/bas041. Print 2012.
5
Using rule-based natural language processing to improve disease normalization in biomedical text.基于规则的自然语言处理在生物医学文本疾病标准化中的应用。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):876-81. doi: 10.1136/amiajnl-2012-001173. Epub 2012 Oct 6.
6
A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions.从 PubMed 中提取 SNP 以关联遗传变异与药物、疾病和不良反应。
J Biomed Inform. 2012 Oct;45(5):842-50. doi: 10.1016/j.jbi.2012.04.006. Epub 2012 Apr 30.
7
MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database.医学:比较毒理学基因组学数据库中使用的实用疾病词汇。
Database (Oxford). 2012 Mar 20;2012:bar065. doi: 10.1093/database/bar065. Print 2012.
8
The gene normalization task in BioCreative III.BioCreative III 中的基因标准化任务。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2105-12-S8-S2.
9
Disease Ontology: a backbone for disease semantic integration.疾病本体论:疾病语义集成的骨干。
Nucleic Acids Res. 2012 Jan;40(Database issue):D940-6. doi: 10.1093/nar/gkr972. Epub 2011 Nov 12.
10
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛:临床文本中的概念、断言和关系
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.