用于生物医学术语标准化的基准测试变压器嵌入模型

Benchmarking Transformer Embedding Models for Biomedical Terminology Standardization.

作者信息

Lahiri Aditya, Shukla Sangeeta, Stear Ben, Ahooyi Taha Mohseni, Beigel Katherine, Margolskee Elizabeth, Taylor Deanne

机构信息

The Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA, USA.

Department of Pathology & Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia PA, USA.

出版信息

Mach Learn Appl. 2025 Sep;21. doi: 10.1016/j.mlwa.2025.100683. Epub 2025 Jun 5.

DOI:10.1016/j.mlwa.2025.100683

PMID:40718094

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12288841/

Abstract

Biomedical text in public databases often exhibits unstandardized terminology and inconsistencies that impede machine learning applications and hinder data integration across biomedical databases. Leveraging generalized and specialized transformer/large language models (LLMs) offers a potential scalable solution for terminology standardization. We evaluated this opportunity using the National Institutes of Health Clinical Trials Registry (CTR), which contains heterogeneous, free-text records of disease from therapeutic trials. To systematically assess the ability of machine learning methods to assign biomedical terms accurately, we benchmarked 36 approaches using transformer/LLM-based text embeddings, along with traditional text-matching algorithms, against a clinical gold standard: the World Health Organization Classification of Tumours system (WHO System, also known as the WHO Blue Books). For this evaluation, we developed CANTOS (Clinical Trials Automated Nomenclature and Tumor Ontology Standardization), a computational benchmarking framework that extracts tumor names from the CTR and standardizes them using the WHO System and the National Cancer Institute Thesaurus (NCIt). We assessed standardization accuracy using a random sample of 1,600 CTR tumor names manually annotated with WHO System terms. LLM/transformer-based embedding methods significantly outperformed text-matching approaches: all-MiniLM-L12-v2+Euclidean distance achieved 67.7% accuracy (WHO-5th edition), while LTE-3+Euclidean distance achieved 69.4% (WHO-all editions). Text-matching methods peaked at 32.6% accuracy. A majority voting approach combining three high-accuracy,low-agreement methods improved accuracy to 71.9% (WHO-5th) and 71.6% (WHO-all). Our findings demonstrate the effectiveness of embedding models in standardizing biomedical terminology and provides a reproducible framework for benchmarking machine learning methods against clinical gold standards using real-world datasets.

摘要

公共数据库中的生物医学文本常常呈现出不规范的术语和不一致性，这阻碍了机器学习应用，并妨碍了生物医学数据库之间的数据整合。利用通用和专用的Transformer/大语言模型（LLM）为术语标准化提供了一种潜在的可扩展解决方案。我们使用美国国立卫生研究院临床试验注册库（CTR）评估了这一机会，该注册库包含来自治疗试验的疾病的异构自由文本记录。为了系统地评估机器学习方法准确分配生物医学术语的能力，我们使用基于Transformer/LLM的文本嵌入以及传统文本匹配算法，以世界卫生组织肿瘤分类系统（WHO系统，也称为WHO蓝皮书）这一临床金标准为基准，对36种方法进行了基准测试。对于此次评估，我们开发了CANTOS（临床试验自动命名和肿瘤本体标准化），这是一个计算基准框架，可从CTR中提取肿瘤名称，并使用WHO系统和美国国立癌症研究所叙词表（NCIt）对其进行标准化。我们使用随机抽取的1600个用WHO系统术语手动注释的CTR肿瘤名称样本评估标准化准确性。基于LLM/Transformer的嵌入方法显著优于文本匹配方法：all-MiniLM-L12-v2 + 欧几里得距离的准确率达到67.7%（WHO第5版），而LTE-3 + 欧几里得距离的准确率达到69.4%（WHO所有版本）。文本匹配方法的准确率最高为32.6%。一种结合三种高精度、低一致性方法的多数投票方法将准确率提高到了71.9%（WHO第5版）和71.6%（WHO所有版本）。我们的研究结果证明了嵌入模型在生物医学术语标准化方面的有效性，并提供了一个可重复的框架，用于使用真实世界数据集针对临床金标准对机器学习方法进行基准测试。