Department of Biomedical Informatics, Columbia University, New York, New York, United States.
School of Nursing, University of Pennsylvania, Philadelphia, Pennsylvania, United States.
Appl Clin Inform. 2024 Mar;15(2):357-367. doi: 10.1055/a-2282-4340. Epub 2024 Mar 6.
Narrative nursing notes are a valuable resource in informatics research with unique predictive signals about patient care. The open sharing of these data, however, is appropriately constrained by rigorous regulations set by the Health Insurance Portability and Accountability Act (HIPAA) for the protection of privacy. Several models have been developed and evaluated on the open-source i2b2 dataset. A focus on the generalizability of these models with respect to nursing notes remains understudied.
The study aims to understand the generalizability of pretrained transformer models and investigate the variability of personal protected health information (PHI) distribution patterns between discharge summaries and nursing notes with a goal to inform the future design for model evaluation schema.
Two pretrained transformer models (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries were evaluated on our data inpatient nursing notes and compared with the baseline performance. Statistical testing was deployed to assess differences in PHI distribution across discharge summaries and nursing notes.
RoBERTa achieved the optimal performance when tested on an external source of data, with an F1 score of 0.887 across PHI categories and 0.932 in the PHI binary task. Overall, discharge summaries contained a higher number of PHI instances and categories of PHI compared with inpatient nursing notes.
The study investigated the applicability of two pretrained transformers on inpatient nursing notes and examined the distinctions between nursing notes and discharge summaries concerning the utilization of personal PHI. Discharge summaries presented a greater quantity of PHI instances and types when compared with narrative nursing notes, but narrative nursing notes exhibited more diversity in the types of PHI present, with some pertaining to patient's personal life. The insights obtained from the research help improve the design and selection of algorithms, as well as contribute to the development of suitable performance thresholds for PHI.
叙事护理记录是信息学研究中的宝贵资源,具有独特的预测患者护理的信号。然而,由于《健康保险流通与责任法案》(HIPAA)对隐私保护的严格规定,这些数据的开放共享受到了适当的限制。已经在开源 i2b2 数据集上开发和评估了几种模型。然而,这些模型在护理记录方面的通用性仍然研究不足。
本研究旨在了解预训练的转换器模型的通用性,并研究出院小结和护理记录之间个人保护健康信息(PHI)分布模式的可变性,旨在为未来的模型评估方案设计提供信息。
对 i2b2 2014 年出院小结进行微调的两个预训练的转换器模型(RoBERTa、ClinicalBERT)在我们的数据住院护理记录上进行了评估,并与基线性能进行了比较。统计测试被用来评估 PHI 分布在出院小结和护理记录之间的差异。
当在外部数据源上进行测试时,RoBERTa 达到了最佳性能,在 PHI 类别上的 F1 分数为 0.887,在 PHI 二进制任务中的分数为 0.932。总体而言,与住院护理记录相比,出院小结包含更多的 PHI 实例和 PHI 类别。
本研究调查了两种预训练的转换器在住院护理记录上的适用性,并研究了护理记录和出院小结在个人 PHI 使用方面的区别。与叙事护理记录相比,出院小结呈现出更多的 PHI 实例和类型,但叙事护理记录在存在的 PHI 类型上表现出更多的多样性,其中一些与患者的个人生活有关。研究获得的见解有助于改进算法的设计和选择,并为 PHI 的适当性能阈值的制定做出贡献。