Paolo Domenico, Greco Carlo, Cortellini Alessio, Ramella Sara, Soda Paolo, Bria Alessandro, Sicilia Rosa
Unit of Computer Systems & Bioinformatics, Department of Engineering, University Campus Bio-Medico di Roma, Roma, Italy.
Research Unit of Radiation Oncology, Department of Medicine and Surgery, University Campus Bio-Medico di Roma, Roma, Italy.
BMC Med Inform Decis Mak. 2025 Apr 18;25(1):169. doi: 10.1186/s12911-025-02998-6.
The automated processing of Electronic Health Records (EHRs) poses a significant challenge due to their unstructured nature, rich in valuable, yet disorganized information. Natural Language Processing (NLP), particularly Named Entity Recognition (NER), has been instrumental in extracting structured information from EHR data. However, existing literature primarly focuses on extracting handcrafted clinical features through NLP and NER methods without delving into their learned representations. In this work, we explore the untapped potential of these representations by considering their contextual richness and entity-specific information. Our proposed methodology extracts representations generated by a transformer-based NER model on EHRs data, combines them using a hierarchical attention mechanism, and employs the obtained enriched representation as input for a clinical prediction model. Specifically, this study addresses Overall Survival (OS) in Non-Small Cell Lung Cancer (NSCLC) using unstructured EHRs data collected from an Italian clinical centre encompassing 838 records from 231 lung cancer patients. Whilst our study is applied on EHRs written in Italian, it serves as use case to prove the effectiveness of extracting and employing high level textual representations that capture relevant information as named entities. Our methodology is interpretable because the hierarchical attention mechanism highlights the information in EHRs that the model considers the most crucial during the decision-making process. We validated this interpretability by measuring the agreement of domain experts on the importance assigned by the hierarchical attention mechanism to EHRs information through a questionnaire. Results demonstrate the effectiveness of our method, showcasing statistically significant improvements over traditional manually extracted clinical features.
电子健康记录(EHRs)的自动化处理面临重大挑战,因为其具有非结构化的性质,包含大量有价值但杂乱无章的信息。自然语言处理(NLP),特别是命名实体识别(NER),在从EHR数据中提取结构化信息方面发挥了重要作用。然而,现有文献主要集中在通过NLP和NER方法提取手工制作的临床特征,而没有深入研究它们的学习表示。在这项工作中,我们通过考虑这些表示的上下文丰富性和实体特定信息来探索其未被挖掘的潜力。我们提出的方法提取基于变压器的NER模型在EHR数据上生成的表示,使用分层注意力机制将它们组合起来,并将获得的丰富表示用作临床预测模型的输入。具体而言,本研究使用从意大利临床中心收集的非结构化EHR数据来解决非小细胞肺癌(NSCLC)的总生存期(OS)问题,该数据包含来自231名肺癌患者的838条记录。虽然我们的研究应用于意大利语书写的EHR,但它作为一个用例来证明提取和使用捕获相关信息作为命名实体的高级文本表示的有效性。我们的方法是可解释的,因为分层注意力机制突出了EHR中模型在决策过程中认为最关键的信息。我们通过问卷调查测量领域专家对分层注意力机制赋予EHR信息的重要性的一致性,从而验证了这种可解释性。结果证明了我们方法的有效性,与传统的手动提取临床特征相比有统计学上的显著改进。