基于集成方法的放射学报告中保护健康信息的识别。

Ensemble Approaches to Recognize Protected Health Information in Radiology Reports.

机构信息

Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA.

Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA.

出版信息

J Digit Imaging. 2022 Dec;35(6):1694-1698. doi: 10.1007/s10278-022-00673-0. Epub 2022 Jun 17.

DOI:10.1007/s10278-022-00673-0

PMID:35715655

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9712864/

Abstract

Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients' diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes.

摘要

自然语言处理 (NLP) 技术在电子健康记录方面显示出了极大的潜力，可以提高医疗质量。放射学报告的文本经常构成电子健康记录数据的很大一部分，并且可以提供有关患者诊断、病史和影像学发现的有价值信息。缺乏主要的放射学报告公共存储库严重限制了新的 NLP 工具的开发、测试和应用。去识别受保护的健康信息 (PHI) 对构建此类存储库提出了重大挑战，因为许多用于去识别的自动化工具是为临床记录而训练或设计的，并且性能不足以构建放射学报告的公共数据库。我们开发并评估了基于三个公开的去识别工具的六个集成模型：MIT de-id、NeuroNER 和 Philter。一组 1023 份报告被留作测试分区。两名具有医学培训背景的人员对测试集进行 PHI 注释；差异通过共识解决。集成方法包括简单投票方案（1-Vote、2-Votes 和 3-Votes）、决策树、朴素贝叶斯分类器和 Adaboost 提升。1-Vote 集成的召回率为 998/1043（95.7%）；3-Votes 集成的精度为 1035/1043（99.2%）。F1 分数分别为：决策树 93.4%、朴素贝叶斯分类器 71.2%和提升方法 87.5%。基本投票算法和机器学习分类器结合了多个工具的预测，可以比每个工具单独进行去识别放射学报告的表现更好。集成方法具有很大的潜力，可以改进自动化放射学报告去识别工具，使这些报告更便于用于研究，以改善患者护理和结果。