CGD Health Pvt. Ltd. Hyderabad, Telangana, India.
School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China.
Stud Health Technol Inform. 2024 Aug 22;316:719-723. doi: 10.3233/SHTI240515.
Automatic deidentification of Electronic Health Records (EHR) is a crucial step in secondary usage for biomedical research. This study introduces evaluation of an intricate hybrid deidentification strategy to enhance patient privacy in secondary usage of EHR. Specifically, this study focuses on assessing automatic deidentification using OpenDeID pipeline across diverse corpora for safeguarding sensitive information within EHR datasets by incorporating diverse corpora. Three distinct corpora were utilized: the OpenDeID v2 corpus containing pathology reports from Australian hospitals, the 2014 i2b2/UTHealth deidentification corpus with clinical narratives from the USA, and the 2016 CEGS N-GRID identification corpus comprising psychiatric notes. The OpenDeID pipeline employs a hybrid approach based on deep learning and contextual rules. Pre-processing steps involved harmonizing and addressing encoding and format issues. Precision, Recall, F-measure metrics were used to assess the performance. The evaluation metrics demonstrated the superior performance of the Discharge Summary BioBERT model. Trained on three corpora with a total of 4,038 reports, the best performing model exhibited robust deidentification capabilities when applied to EHR. It achieved impressive micro-averaged F1-scores of 0.9248 and 0.9692 for strict and relaxed settings, respectively. These results offer valuable insights into the model's efficacy and its potential role in safeguarding patient privacy in secondary usage of EHR.
电子健康记录 (EHR) 的自动去识别是生物医学研究二次利用的关键步骤。本研究介绍了一种复杂的混合去识别策略的评估,以增强 EHR 二次利用中的患者隐私。具体来说,本研究侧重于评估使用 OpenDeID 管道在不同语料库中进行的自动去识别,以通过合并不同的语料库来保护 EHR 数据集中的敏感信息。使用了三个不同的语料库:包含澳大利亚医院病理报告的 OpenDeID v2 语料库、来自美国的包含临床叙述的 2014 i2b2/UTHealth 去识别语料库以及包含精神科笔记的 2016 CEGS N-GRID 识别语料库。OpenDeID 管道采用基于深度学习和上下文规则的混合方法。预处理步骤包括协调和解决编码和格式问题。使用精度、召回率和 F 度量指标来评估性能。评估指标表明 Discharge Summary BioBERT 模型的性能更优。在三个语料库上进行训练,共有 4038 份报告,该最佳模型在应用于 EHR 时表现出强大的去识别能力。它在严格和宽松设置下分别实现了令人印象深刻的微平均 F1 得分为 0.9248 和 0.9692。这些结果提供了有关模型功效及其在保护 EHR 二次利用中患者隐私方面的潜在作用的有价值的见解。