基于语言嵌入电子健康记录的肺结节诊断纵向掩码表示学习

Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs.

作者信息

Li Thomas Z, Still John M, Zuo Lianrui, Liu Yihao, Krishnan Aravind R, Sandler Kim L, Maldonado Fabien, Lasko Thomas A, Landman Bennett A

机构信息

Department of Biomedical Engineering, Vanderbilt University, Nashville, TN.

Medical Scientist Training Program, Vanderbilt University, Nashville, TN.

出版信息

medRxiv. 2025 May 11:2025.05.09.25327341. doi: 10.1101/2025.05.09.25327341.

DOI:10.1101/2025.05.09.25327341

PMID:40385386

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12083608/

Abstract

Electronic health records (EHRs) are a rich source of clinical data, yet exploiting longitudinal signals for pulmonary nodule diagnosis remains challenging due to the administrative noise and high level of clinical abstraction present in these records. Because of this complexity, classification models are prone to overfitting when labeled data is scarce. This study explores masked representation learning (MRL) as a strategy to improve pulmonary nodule diagnosis by modeling longitudinal EHRs across multiple modalities: clinical conditions, procedures, and medications. We leverage a web-scale text embedding model to encode EHR event streams into semantically embedded sequences. We then pretrain a bidirectional transformer using MRL conditioned on time encodings on a large cohort of general pulmonary conditions from our home institution. Evaluation on a cohort of diagnosed pulmonary nodules demonstrates significant improvement in diagnosis accuracy with a model finetuned from MRL (0.781 AUC, 95% CI: [0.780, 0.782]) compared to a supervised model with the same architecture (0.768 AUC, 95% CI: [0.766, 0.770]) when integrating all three modalities. These findings suggest that language-embedded MRL can facilitate downstream clinical classification, offering potential advancements in the comprehensive analysis of longitudinal EHR modalities.

摘要

电子健康记录（EHRs）是临床数据的丰富来源，但由于这些记录中存在管理噪声和高度的临床抽象性，利用纵向信号进行肺结节诊断仍然具有挑战性。由于这种复杂性，当标记数据稀缺时，分类模型容易出现过拟合。本研究探索掩码表示学习（MRL）作为一种策略，通过对跨多种模式的纵向EHRs进行建模来改善肺结节诊断，这些模式包括临床状况、程序和药物。我们利用一个网络规模的文本嵌入模型将EHR事件流编码为语义嵌入序列。然后，我们使用基于时间编码的MRL对来自我们所在机构的大量一般肺部疾病队列进行双向变压器预训练。对一组已诊断的肺结节进行评估表明，与具有相同架构的监督模型（0.768 AUC，95% CI：[0.766, 0.770]）相比，在整合所有三种模式时，从MRL微调的模型在诊断准确性方面有显著提高（0.781 AUC，95% CI：[0.780, 0.782]）。这些发现表明，语言嵌入的MRL可以促进下游临床分类，为纵向EHR模式的综合分析提供潜在的进展。