Dorr D A, Phillips W F, Phansalkar S, Sims S A, Hurdle J F
Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR 97239, USA.
Methods Inf Med. 2006;45(3):246-52.
To characterize the difficulty confronting investigators in removing protected health information (PHI) from cross-discipline, free-text clinical notes, an important challenge to clinical informatics research as recalibrated by the introduction of the US Health Insurance Portability and Accountability Act (HIPAA) and similar regulations.
Randomized selection of clinical narratives from complete admissions written by diverse providers, reviewed using a two-tiered rater system and simple automated regular expression tools. For manual review, two independent reviewers used simple search and replace algorithms and visual scanning to find PHI as defined by HIPAA, followed by an independent second review to detect any missed PHI. Simple automated review was also performed for the "easy" PHI that are number- or date-based.
From 262 notes, 2074 PHI, or 7.9 +/- 6.1 per note, were found. The average recall (or sensitivity) was 95.9% while precision was 99.6% for single reviewers. Agreement between individual reviewers was strong (ICC = 0.99), although some asymmetry in errors was seen between reviewers (p = 0.001). The automated technique had better recall (98.5%) but worse precision (88.4%) for its subset of identifiers. Manually de-identifying a note took 87.3 +/- 61 seconds on average.
Manual de-identification of free-text notes is tedious and time-consuming, but even simple PHI is difficult to automatically identify with the exactitude required under HIPAA.
描述研究人员在从跨学科的自由文本临床记录中移除受保护的健康信息(PHI)时所面临的困难,这是临床信息学研究面临的一项重大挑战,美国《健康保险流通与责任法案》(HIPAA)及类似法规的引入对其进行了重新调整。
从不同医疗服务提供者撰写的完整住院记录中随机选取临床叙述,使用两级评分系统和简单的自动正则表达式工具进行审查。对于人工审查,两名独立审查员使用简单的搜索和替换算法以及视觉扫描来查找HIPAA定义的PHI,随后进行独立的二次审查以检测任何遗漏的PHI。对于基于数字或日期的“简单”PHI也进行了简单的自动审查。
在262份记录中,共发现2074条PHI,每份记录平均有7.9±6.1条。单个审查员的平均召回率(或敏感度)为95.9%,而精确率为99.6%。个体审查员之间的一致性很强(组内相关系数=0.99),尽管审查员之间在错误方面存在一些不对称性(p=0.001)。自动技术对于其标识符子集的召回率更高(98.5%),但精确率更低(88.4%)。手动对一份记录进行去识别平均需要87.3±61秒。
对自由文本记录进行手动去识别既繁琐又耗时,但即使是简单的PHI也难以按照HIPAA要求的精确程度自动识别。