Stubbs Amber, Kotfila Christopher, Uzuner Özlem
School of Library and Information Science, Simmons College, Boston, MA, USA.
Department of Information Studies, State University of New York at Albany, Albany, NY, USA.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.
The 2014 i2b2/UTHealth Natural Language Processing (NLP) shared task featured four tracks. The first of these was the de-identification track focused on identifying protected health information (PHI) in longitudinal clinical narratives. The longitudinal nature of clinical narratives calls particular attention to details of information that, while benign on their own in separate records, can lead to identification of patients in combination in longitudinal records. Accordingly, the 2014 de-identification track addressed a broader set of entities and PHI than covered by the Health Insurance Portability and Accountability Act - the focus of the de-identification shared task that was organized in 2006. Ten teams tackled the 2014 de-identification task and submitted 22 system outputs for evaluation. Each team was evaluated on their best performing system output. Three of the 10 systems achieved F1 scores over .90, and seven of the top 10 scored over .75. The most successful systems combined conditional random fields and hand-written rules. Our findings indicate that automated systems can be very effective for this task, but that de-identification is not yet a solved problem.
2014年i2b2/德克萨斯大学健康科学中心自然语言处理(NLP)共享任务有四个赛道。其中第一个是去识别赛道,专注于在纵向临床叙述中识别受保护的健康信息(PHI)。临床叙述的纵向性质特别关注信息细节,这些细节虽然在单独记录中本身无害,但在纵向记录中组合起来可能导致患者被识别。因此,2014年去识别赛道处理的实体和PHI比《健康保险流通与责任法案》涵盖的范围更广——2006年组织的去识别共享任务的重点。十个团队参与了2014年去识别任务并提交了22个系统输出进行评估。每个团队根据其表现最佳的系统输出进行评估。十个系统中有三个的F1分数超过0.90,排名前十的系统中有七个得分超过0.75。最成功的系统结合了条件随机场和手写规则。我们的研究结果表明,自动化系统对于这项任务可能非常有效,但去识别尚未成为一个已解决的问题。