School of Computer Science, University of Manchester, Manchester, UK; The Christie NHS Foundation Trust, Manchester, UK.
Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia.
J Biomed Inform. 2017 Nov;75S:S28-S33. doi: 10.1016/j.jbi.2017.06.005. Epub 2017 Jun 7.
De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
去识别临床叙述是使医疗保健自由文本可用于研究的主要障碍之一。在本文中,我们描述了在 2016 年 CEGS N-GRID 共享任务跟踪 1 中扩展和调整两个现有工具的经验,该任务在一组最多 25 种不同类型的受保护健康信息 (PHI) 的精神科评估记录上评估去识别方法。我们使用的方法依赖于大型或小型特征空间上的机器学习,以及包括两阶段标记和多类模型在内的其他策略,这两者都被证明是有益的。结果表明,所提出的方法的集成可以识别健康保险携带和责任法案 (HIPAA) 定义的 PHI,总体 F 分数约为 90%及以上。然而,某些类别(专业、组织)再次被证明具有挑战性,因为用于引用给定信息的表达式具有可变性。