Institute for Computational Health Sciences, University of California, San Francisco, 550 16th Street, San Francisco, CA, USA.
Department of Radiology and Biomedical Imaging, University of California, San Francisco, 185 Berry St., San Francisco, CA, 94143-0946, USA.
J Digit Imaging. 2019 Apr;32(2):228-233. doi: 10.1007/s10278-018-0154-z.
Applying state-of-the-art machine learning techniques to medical images requires a thorough selection and normalization of input data. One of such steps in digital mammography screening for breast cancer is the labeling and removal of special diagnostic views, in which diagnostic tools or magnification are applied to assist in assessment of suspicious initial findings. As a common task in medical informatics is prediction of disease and its stage, these special diagnostic views, which are only enriched among the cohort of diseased cases, will bias machine learning disease predictions. In order to automate this process, here, we develop a machine learning pipeline that utilizes both DICOM headers and images to predict such views in an automatic manner, allowing for their removal and the generation of unbiased datasets. We achieve AUC of 99.72% in predicting special mammogram views when combining both types of models. Finally, we apply these models to clean up a dataset of about 772,000 images with expected sensitivity of 99.0%. The pipeline presented in this paper can be applied to other datasets to obtain high-quality image sets suitable to train algorithms for disease detection.
将最先进的机器学习技术应用于医学图像需要对输入数据进行彻底的选择和规范化。在乳腺癌数字乳腺筛查中,这样的步骤之一是对特殊诊断视图进行标记和去除,在这些视图中应用诊断工具或放大功能来辅助评估可疑的初始发现。由于医学信息学中的常见任务是预测疾病及其阶段,因此这些仅在患病病例组中丰富的特殊诊断视图会使机器学习疾病预测产生偏差。为了实现此过程的自动化,在这里,我们开发了一个机器学习管道,该管道利用 DICOM 标头和图像来自动预测此类视图,从而可以去除这些视图并生成无偏差的数据集。当结合使用这两种类型的模型时,我们在预测特殊乳房 X 光视图方面实现了 99.72%的 AUC。我们将这些模型应用于清理一个包含约 772000 张图像的数据集,预计灵敏度为 99.0%。本文提出的管道可应用于其他数据集,以获得适合训练疾病检测算法的高质量图像集。