Suppr超能文献

用于生物医学术语标准化的基准测试变压器嵌入模型

Benchmarking Transformer Embedding Models for Biomedical Terminology Standardization.

作者信息

Lahiri Aditya, Shukla Sangeeta, Stear Ben, Ahooyi Taha Mohseni, Beigel Katherine, Margolskee Elizabeth, Taylor Deanne

机构信息

The Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA, USA.

Department of Pathology & Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia PA, USA.

出版信息

Mach Learn Appl. 2025 Sep;21. doi: 10.1016/j.mlwa.2025.100683. Epub 2025 Jun 5.

Abstract

Biomedical text in public databases often exhibits unstandardized terminology and inconsistencies that impede machine learning applications and hinder data integration across biomedical databases. Leveraging generalized and specialized transformer/large language models (LLMs) offers a potential scalable solution for terminology standardization. We evaluated this opportunity using the National Institutes of Health Clinical Trials Registry (CTR), which contains heterogeneous, free-text records of disease from therapeutic trials. To systematically assess the ability of machine learning methods to assign biomedical terms accurately, we benchmarked 36 approaches using transformer/LLM-based text embeddings, along with traditional text-matching algorithms, against a clinical gold standard: the World Health Organization Classification of Tumours system (WHO System, also known as the WHO Blue Books). For this evaluation, we developed CANTOS (Clinical Trials Automated Nomenclature and Tumor Ontology Standardization), a computational benchmarking framework that extracts tumor names from the CTR and standardizes them using the WHO System and the National Cancer Institute Thesaurus (NCIt). We assessed standardization accuracy using a random sample of 1,600 CTR tumor names manually annotated with WHO System terms. LLM/transformer-based embedding methods significantly outperformed text-matching approaches: all-MiniLM-L12-v2+Euclidean distance achieved 67.7% accuracy (WHO-5th edition), while LTE-3+Euclidean distance achieved 69.4% (WHO-all editions). Text-matching methods peaked at 32.6% accuracy. A majority voting approach combining three high-accuracy,low-agreement methods improved accuracy to 71.9% (WHO-5th) and 71.6% (WHO-all). Our findings demonstrate the effectiveness of embedding models in standardizing biomedical terminology and provides a reproducible framework for benchmarking machine learning methods against clinical gold standards using real-world datasets.

摘要

公共数据库中的生物医学文本常常呈现出不规范的术语和不一致性,这阻碍了机器学习应用,并妨碍了生物医学数据库之间的数据整合。利用通用和专用的Transformer/大语言模型(LLM)为术语标准化提供了一种潜在的可扩展解决方案。我们使用美国国立卫生研究院临床试验注册库(CTR)评估了这一机会,该注册库包含来自治疗试验的疾病的异构自由文本记录。为了系统地评估机器学习方法准确分配生物医学术语的能力,我们使用基于Transformer/LLM的文本嵌入以及传统文本匹配算法,以世界卫生组织肿瘤分类系统(WHO系统,也称为WHO蓝皮书)这一临床金标准为基准,对36种方法进行了基准测试。对于此次评估,我们开发了CANTOS(临床试验自动命名和肿瘤本体标准化),这是一个计算基准框架,可从CTR中提取肿瘤名称,并使用WHO系统和美国国立癌症研究所叙词表(NCIt)对其进行标准化。我们使用随机抽取的1600个用WHO系统术语手动注释的CTR肿瘤名称样本评估标准化准确性。基于LLM/Transformer的嵌入方法显著优于文本匹配方法:all-MiniLM-L12-v2 + 欧几里得距离的准确率达到67.7%(WHO第5版),而LTE-3 + 欧几里得距离的准确率达到69.4%(WHO所有版本)。文本匹配方法的准确率最高为32.6%。一种结合三种高精度、低一致性方法的多数投票方法将准确率提高到了71.9%(WHO第5版)和71.6%(WHO所有版本)。我们的研究结果证明了嵌入模型在生物医学术语标准化方面的有效性,并提供了一个可重复的框架,用于使用真实世界数据集针对临床金标准对机器学习方法进行基准测试。

相似文献

1
Benchmarking Transformer Embedding Models for Biomedical Terminology Standardization.
Mach Learn Appl. 2025 Sep;21. doi: 10.1016/j.mlwa.2025.100683. Epub 2025 Jun 5.
2
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.
Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.
3
6
A dataset and benchmark for hospital course summarization with adapted large language models.
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
7
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

本文引用的文献

2
Cancer statistics, 2024.
CA Cancer J Clin. 2024 Jan-Feb;74(1):12-49. doi: 10.3322/caac.21820. Epub 2024 Jan 17.
3
The Human Phenotype Ontology in 2024: phenotypes around the world.
Nucleic Acids Res. 2024 Jan 5;52(D1):D1333-D1346. doi: 10.1093/nar/gkad1005.
4
The DO-KB Knowledgebase: a 20-year journey developing the disease open science ecosystem.
Nucleic Acids Res. 2024 Jan 5;52(D1):D1305-D1314. doi: 10.1093/nar/gkad1051.
5
Exploring Barriers to Pediatric Cancer Clinical Trials: The Role of a Networked, Just-in-Time Study Program.
Clin Ther. 2023 Nov;45(11):1148-1150. doi: 10.1016/j.clinthera.2023.08.022. Epub 2023 Sep 30.
6
Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial.
Nat Med. 2023 Oct;29(10):2633-2642. doi: 10.1038/s41591-023-02552-9. Epub 2023 Sep 14.
7
Large-scale neural biomedical entity linking with layer overwriting.
J Biomed Inform. 2023 Jul;143:104433. doi: 10.1016/j.jbi.2023.104433. Epub 2023 Jun 27.
8
Exclusion of pregnant and lactating persons from breast cancer clinical trials: a review of active trials registered on ClinicalTrials.gov.
Acta Obstet Gynecol Scand. 2024 Apr;103(4):707-715. doi: 10.1111/aogs.14599. Epub 2023 Jun 28.
10
Assessing resource use: a case study with the Human Disease Ontology.
Database (Oxford). 2023 Feb 28;2023. doi: 10.1093/database/baad007.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验