• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用语义类型预测和大规模数据集提高全面的医学实体链接。

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets.

机构信息

Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA.

University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA, USA.

出版信息

J Biomed Inform. 2021 Sep;121:103880. doi: 10.1016/j.jbi.2021.103880. Epub 2021 Aug 12.

DOI:10.1016/j.jbi.2021.103880
PMID:34390853
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8952339/
Abstract

OBJECTIVES

Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction-extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types.

METHODS

We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking.

RESULTS

Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text.

CONCLUSIONS

Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.

摘要

目的

生物医学自然语言处理工具越来越多地被应用于广泛覆盖的信息提取——从科学文献或临床记录中提取各种类型的医学信息。在这种广泛覆盖的环境中,将医学概念的提及与标准化词汇表联系起来需要从涵盖数十种类型的大型库存中选择最佳候选概念。本研究提出了一种新的生物医学自然语言处理管道的语义类型预测模块,以及两个具有广泛语义类型覆盖的自动构建的大规模数据集。

方法

我们在四个用于从科学文献和临床记录中提取医学信息的基准数据集上,对五个现成的生物医学自然语言处理工具包进行了实验。所有工具包都采用分阶段的方法,即先进行提及检测,然后再进行两个阶段的医学实体链接:(1)生成候选概念列表,(2)从其中选择最佳概念。我们引入了一种语义类型预测模块,通过根据提及的预测语义类型过滤掉不相关的候选概念来缓解候选概念过度生成的问题。我们提出了 MedType,这是一种完全模块化的语义类型预测模型,我们将其集成到现有的自然语言处理工具包中。为了解决医学信息提取中缺乏广泛覆盖的训练数据的问题,我们进一步提出了 WikiMed 和 PubMedDS,这两个用于医学实体链接的大规模数据集。

结果

语义类型过滤提高了所有工具包和数据集的医学实体链接性能,通常可以提高几个百分点的 F-1 值。此外,在我们的新数据集上对 MedType 进行预训练可以实现生物医学文本中语义类型预测的最新性能。

结论

语义类型预测是从生物医学文本中构建准确的广泛覆盖信息提取的自然语言处理管道的关键部分。我们公开了我们的源代码和新数据集,以促进可重复的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/4ec09fc48d1f/nihms-1786948-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/eb39312e83ea/nihms-1786948-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/d2c0222e8343/nihms-1786948-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/01f6186653d3/nihms-1786948-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/0849b33b332d/nihms-1786948-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/1b70997a0f29/nihms-1786948-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/cf36459fe0ed/nihms-1786948-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/4ec09fc48d1f/nihms-1786948-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/eb39312e83ea/nihms-1786948-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/d2c0222e8343/nihms-1786948-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/01f6186653d3/nihms-1786948-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/0849b33b332d/nihms-1786948-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/1b70997a0f29/nihms-1786948-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/cf36459fe0ed/nihms-1786948-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658b/8952339/4ec09fc48d1f/nihms-1786948-f0007.jpg

相似文献

1
Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets.利用语义类型预测和大规模数据集提高全面的医学实体链接。
J Biomed Inform. 2021 Sep;121:103880. doi: 10.1016/j.jbi.2021.103880. Epub 2021 Aug 12.
2
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets.医学概念规范化中的歧义:电子健康记录数据集的类型和覆盖范围分析。
J Am Med Inform Assoc. 2021 Mar 1;28(3):516-532. doi: 10.1093/jamia/ocaa269.
3
Broad-coverage biomedical relation extraction with SemRep.基于 SemRep 的广谱生物医学关系抽取。
BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records.2019 年全国自然语言处理(NLP)临床挑战(n2c2)/开放健康自然语言处理(OHNLP)临床记录临床概念规范化共享任务。
J Am Med Inform Assoc. 2020 Oct 1;27(10):1529-1537. doi: 10.1093/jamia/ocaa106.
6
Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.基于大语言模型的文本简化技术提升复杂实体提及的生物医学实体链接
Database (Oxford). 2024 Jul 26;2024. doi: 10.1093/database/baae067.
7
Large-scale neural biomedical entity linking with layer overwriting.大规模神经生物医学实体链接与层覆盖。
J Biomed Inform. 2023 Jul;143:104433. doi: 10.1016/j.jbi.2023.104433. Epub 2023 Jun 27.
8
Knowledge Author: facilitating user-driven, domain content development to support clinical information extraction.知识作者:促进用户驱动的领域内容开发,以支持临床信息提取。
J Biomed Semantics. 2016 Jun 23;7(1):42. doi: 10.1186/s13326-016-0086-9.
9
Enhancing the coverage of SemRep using a relation classification approach.利用关系分类方法增强 SemRep 的覆盖范围。
J Biomed Inform. 2024 Jul;155:104658. doi: 10.1016/j.jbi.2024.104658. Epub 2024 May 21.
10
PPR-SSM: personalized PageRank and semantic similarity measures for entity linking.PPR-SSM:用于实体链接的个性化 PageRank 和语义相似性度量。
BMC Bioinformatics. 2019 Oct 29;20(1):534. doi: 10.1186/s12859-019-3157-y.

引用本文的文献

1
Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes.基于多头条件随机场分类器的西班牙语临床文档中生物医学多类命名实体识别。
Database (Oxford). 2024 Jul 30;2024. doi: 10.1093/database/baae068.
2
NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval.NeighBERT:使用关系诱导密集检索的医学实体链接
J Healthc Inform Res. 2024 Jan 18;8(2):353-369. doi: 10.1007/s41666-023-00136-3. eCollection 2024 Jun.
3
MetaTron: advancing biomedical annotation empowering relation annotation and collaboration.

本文引用的文献

1
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.停止为高风险决策解释黑箱机器学习模型,转而使用可解释模型。
Nat Mach Intell. 2019 May;1(5):206-215. doi: 10.1038/s42256-019-0048-x. Epub 2019 May 13.
2
Medical concept normalization in French using multilingual terminologies and contextual embeddings.使用多语言术语和上下文嵌入进行法语医学概念规范化。
J Biomed Inform. 2021 Feb;114:103684. doi: 10.1016/j.jbi.2021.103684. Epub 2021 Jan 12.
3
A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling.
MetaTron:推进生物医学标注,赋能关系标注与协作。
BMC Bioinformatics. 2024 Mar 14;25(1):112. doi: 10.1186/s12859-024-05730-9.
4
B-LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism.B-LBConA:一种基于 Bio-LinkBERT 和上下文感知机制的医学实体消歧模型。
BMC Bioinformatics. 2023 Mar 16;24(1):97. doi: 10.1186/s12859-023-05209-z.
5
An overview of biomedical entity linking throughout the years.生物医学实体链接概述。
J Biomed Inform. 2023 Jan;137:104252. doi: 10.1016/j.jbi.2022.104252. Epub 2022 Dec 2.
6
Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing.2021 年:COVID-19、医学自然语言处理中的信息抽取和 BERT 化成为热门话题。
Yearb Med Inform. 2022 Aug;31(1):254-260. doi: 10.1055/s-0042-1742547. Epub 2022 Dec 4.
7
Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics.使用深度学习和启发式方法在 PubMed 全文文章中进行化学物质的识别和标引。
Database (Oxford). 2022 Jul 1;2022. doi: 10.1093/database/baac047.
8
Different Data Mining Approaches Based Medical Text Data.基于医学文本数据的不同数据挖掘方法。
J Healthc Eng. 2021 Dec 6;2021:1285167. doi: 10.1155/2021/1285167. eCollection 2021.
9
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health.对研究不足的医学概念领域进行自动编码:将身体活动报告与《国际功能、残疾和健康分类》相联系。
Front Digit Health. 2021 Mar;3. doi: 10.3389/fdgth.2021.620828. Epub 2021 Mar 10.
临床笔记中移动功能信息的综合研究:实体层次结构、语料库标注和序列标记。
Int J Med Inform. 2021 Mar;147:104351. doi: 10.1016/j.ijmedinf.2020.104351. Epub 2020 Dec 24.
4
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets.医学概念规范化中的歧义:电子健康记录数据集的类型和覆盖范围分析。
J Am Med Inform Assoc. 2021 Mar 1;28(3):516-532. doi: 10.1093/jamia/ocaa269.
5
Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies.自然语言处理算法在将临床文本片段映射到本体概念上的应用:系统评价及对未来研究的建议。
J Biomed Semantics. 2020 Nov 16;11(1):14. doi: 10.1186/s13326-020-00231-z.
6
The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records.2019 年全国自然语言处理(NLP)临床挑战(n2c2)/开放健康自然语言处理(OHNLP)临床记录临床概念规范化共享任务。
J Am Med Inform Assoc. 2020 Oct 1;27(10):1529-1537. doi: 10.1093/jamia/ocaa106.
7
BERT-based Ranking for Biomedical Entity Normalization.基于BERT的生物医学实体规范化排序
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. eCollection 2020.
8
Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study.用于宫颈癌和肛门癌及癌前病变监测的自然语言处理:算法开发与分割验证研究
JMIR Med Inform. 2020 Nov 3;8(11):e20826. doi: 10.2196/20826.
9
Racial disparities in automated speech recognition.种族差异与自动化语音识别。
Proc Natl Acad Sci U S A. 2020 Apr 7;117(14):7684-7689. doi: 10.1073/pnas.1915768117. Epub 2020 Mar 23.
10
Pretreatment Frailty Is Independently Associated With Increased Risk of Infections After Immunosuppression in Patients With Inflammatory Bowel Diseases.预处理衰弱与炎症性肠病患者免疫抑制后感染风险增加独立相关。
Gastroenterology. 2020 Jun;158(8):2104-2111.e2. doi: 10.1053/j.gastro.2020.02.032. Epub 2020 Feb 25.