Norgeot Beau, Muenzen Kathleen, Peterson Thomas A, Fan Xuancheng, Glicksberg Benjamin S, Schenk Gundolf, Rutenberg Eugenia, Oskotsky Boris, Sirota Marina, Yazdany Jinoos, Schmajuk Gabriela, Ludwig Dana, Goldstein Theodore, Butte Atul J
1Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA USA.
2Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA USA.
NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods.
除了结构化病历数据中可用数据的匮乏之外,对于确定患者在疾病进展、实际护理实践、病理学、不良事件等方面的确切状况,存在着巨大且不断增长的需求。确定这些难以获取的数据元素对于复杂性状的准确表型分析、不良结局的检测、非标签药物使用的疗效以及患者的纵向监测至关重要。临床记录通常包含有关个体患者、其疾病细微差别、医生选择的治疗策略以及最终结果的最详细和相关的数字信息。然而,由于临床记录包含受保护的健康信息(PHI),即与个人身份识别数据同义的信息,因此在很大程度上仍未用于研究。以前的临床记录去识别方法很僵化,而且仍然不够准确,无法在现实世界中得到广泛应用,主要是因为它们是用太小的医学文本语料库进行训练的。为了构建一种新的去识别工具,我们创建了最大的用于PHI的手动注释临床记录语料库,并开发了一种名为Philter(“受保护健康信息过滤器”)的可定制开源去识别软件。在此,我们描述了Philter的设计和评估,并展示了它如何在现实世界中比以前的方法有实质性的改进。