Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis (IUPUI), Indianapolis, IN, 46202, USA.
Indiana University School of Medicine, 340 W. 10th St, Indianapolis, IN, 46202, USA; Regenstrief Institute, Inc., 1101 W. 10th Street, Indianapolis, IN, 46202, USA.
Comput Biol Med. 2024 Nov;182:109144. doi: 10.1016/j.compbiomed.2024.109144. Epub 2024 Sep 18.
Several general-purpose language model (LM) architectures have been proposed with demonstrated improvement in text summarization and classification. Adapting these architectures to the medical domain requires additional considerations. For instance, the medical history of the patient is documented in the Electronic Health Record (EHR) which includes many medical notes drafted by healthcare providers. Direct processing of these notes may not be possible because the computational complexity of LMs imposes a limit on the length of input text. Therefore, previous applications resorted to content selection using truncation or summarization of the text. Unfortunately, these text processing techniques may lead to information loss, redundancy or irrelevance. In the present paper, a decision-focused content selection technique is proposed. The objective of this technique is to select a subset of sentences from the medical notes of a patient that are relevant to the target outcome over a predefined observation period. This decision-focused content selection methodology is then used to develop a dementia risk prediction model based on the Longformer LM architecture. The results show that the proposed framework delivers an AUC of 78.43 when the summary is restricted to 1024 tokens, outperforming previously proposed content selection techniques. This performance is notable given that the model estimates dementia risk with a one year prediction horizon, relies on an observation period of only one year and solely uses medical notes without other EHR data modalities. Moreover, the proposed techniques overcome the limitation of machine learning models that use a tabular representation of the text by preserving contextual content, enable feature engineering from raw text and circumvent the computational complexity of language models.
已经提出了几种通用语言模型 (LM) 架构,这些架构在文本摘要和分类方面表现出了显著的改进。将这些架构应用于医学领域需要额外的考虑。例如,患者的病史记录在电子健康记录 (EHR) 中,其中包括许多由医疗保健提供者起草的医疗记录。由于语言模型的计算复杂度对输入文本的长度有限制,因此直接处理这些记录可能是不可能的。因此,以前的应用程序采用使用截断或摘要的方法来选择内容。不幸的是,这些文本处理技术可能会导致信息丢失、冗余或不相关。在本文中,提出了一种基于决策的内容选择技术。该技术的目标是从患者的医疗记录中选择与目标结果相关的句子子集,这些句子在预定义的观察期内。然后,使用这种基于决策的内容选择方法来基于 Longformer LM 架构开发痴呆风险预测模型。结果表明,当摘要限制为 1024 个令牌时,所提出的框架提供了 78.43 的 AUC,优于以前提出的内容选择技术。鉴于该模型使用仅一年的观察期和仅使用医疗记录而不使用其他 EHR 数据模式来预测一年的痴呆风险,该性能是显著的。此外,所提出的技术通过保留上下文内容、从原始文本中进行特征工程以及规避语言模型的计算复杂性,克服了使用文本的表格表示形式的机器学习模型的局限性。