Stubbs Amber, Uzuner Özlem
School of Library and Information Science, Simmons College, Boston, MA, USA.
Department of Information Studies, State University of New York at Albany, Albany, NY, USA.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.
The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations.
2014年i2b2/德克萨斯大学健康科学中心自然语言处理共享任务中有一个专注于纵向医疗记录去识别化的赛道。针对这个赛道,我们对一组描述296名患者的1304份纵向医疗记录进行了去识别化处理。该语料库是根据对《健康保险流通与责任法案》(HIPAA)指南的宽泛解释进行去识别化的,采用了双重标注,随后进行仲裁、多轮合理性检查和校对。与金标准相比,注释者基于token的平均F1值为0.927。所得注释既用于对数据进行去识别化,也用于为2014年i2b2/德克萨斯大学健康科学中心共享任务的去识别化赛道设定金标准。所有带注释的私人健康信息都自动替换为逼真的替代物,然后进行人工审阅和修正。所得语料库是首个可用于去识别化研究的此类语料库。该语料库首次用于2014年i2b2/德克萨斯大学健康科学中心共享任务,在此期间,各系统使用基于实体的微观平均评估方法,平均F值达到0.872,最大F值达到0.964。