Department of Radiology, Duke University, Durham, NC, USA.
School of Medicine, Duke University, Durham, NC, USA.
J Imaging Inform Med. 2024 Oct;37(5):1-7. doi: 10.1007/s10278-024-01098-7. Epub 2024 Apr 8.
De-identification of DICOM images is an essential component of medical image research. While many established methods exist for the safe removal of protected health information (PHI) in DICOM metadata, approaches for the removal of PHI "burned-in" to image pixel data are typically manual, and automated high-throughput approaches are not well validated. Emerging optical character recognition (OCR) models can potentially detect and remove PHI-bearing text from medical images but are very time-consuming to run on the high volume of images found in typical research studies. We present a data processing method that performs metadata de-identification for all images combined with a targeted approach to only apply OCR to images with a high likelihood of burned-in text. The method was validated on a dataset of 415,182 images across ten modalities representative of the de-identification requests submitted at our institution over a 20-year span. Of the 12,578 images in this dataset with burned-in text of any kind, only 10 passed undetected with the method. OCR was only required for 6050 images (1.5% of the dataset).
DICOM 图像去识别是医学图像研究的一个重要组成部分。虽然已经有许多成熟的方法可以安全地去除 DICOM 元数据中的保护健康信息(PHI),但去除图像像素数据中“嵌入”的 PHI 的方法通常是手动的,并且自动化的高通量方法尚未得到很好的验证。新兴的光学字符识别(OCR)模型可以从医学图像中检测和去除包含 PHI 的文本,但在典型的研究中处理大量图像时非常耗时。我们提出了一种数据处理方法,该方法可以对所有图像进行元数据去识别,并结合一种有针对性的方法,仅对有高概率嵌入文本的图像应用 OCR。该方法在一个包含 415182 张图像的数据集上进行了验证,这些图像代表了我们机构在 20 年期间提交的去识别请求的 10 种模态。在这个包含任何类型的嵌入文本的 12578 张图像中,只有 10 张未被该方法检测到。仅需要对 6050 张图像(占数据集的 1.5%)进行 OCR。