基于集成方法的放射学报告中保护健康信息的识别。
Ensemble Approaches to Recognize Protected Health Information in Radiology Reports.
机构信息
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA.
Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA.
出版信息
J Digit Imaging. 2022 Dec;35(6):1694-1698. doi: 10.1007/s10278-022-00673-0. Epub 2022 Jun 17.
Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients' diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes.
自然语言处理 (NLP) 技术在电子健康记录方面显示出了极大的潜力,可以提高医疗质量。放射学报告的文本经常构成电子健康记录数据的很大一部分,并且可以提供有关患者诊断、病史和影像学发现的有价值信息。缺乏主要的放射学报告公共存储库严重限制了新的 NLP 工具的开发、测试和应用。去识别受保护的健康信息 (PHI) 对构建此类存储库提出了重大挑战,因为许多用于去识别的自动化工具是为临床记录而训练或设计的,并且性能不足以构建放射学报告的公共数据库。我们开发并评估了基于三个公开的去识别工具的六个集成模型:MIT de-id、NeuroNER 和 Philter。一组 1023 份报告被留作测试分区。两名具有医学培训背景的人员对测试集进行 PHI 注释;差异通过共识解决。集成方法包括简单投票方案(1-Vote、2-Votes 和 3-Votes)、决策树、朴素贝叶斯分类器和 Adaboost 提升。1-Vote 集成的召回率为 998/1043(95.7%);3-Votes 集成的精度为 1035/1043(99.2%)。F1 分数分别为:决策树 93.4%、朴素贝叶斯分类器 71.2%和提升方法 87.5%。基本投票算法和机器学习分类器结合了多个工具的预测,可以比每个工具单独进行去识别放射学报告的表现更好。集成方法具有很大的潜力,可以改进自动化放射学报告去识别工具,使这些报告更便于用于研究,以改善患者护理和结果。