Sharma Vaibhav, Barnett Alina Jade, Yang Julia, Cheon Sangwook, Kim Giyoung, Regina Schwartz Fides, Wang Avivah, Hall Neal, Grimm Lars, Chen Chaofan, Lo Joseph Y, Rudin Cynthia
Duke University, Department of Computer Science, Durham, North Carolina, United States.
Duke University School of Medicine, Department of Radiology, Durham, North Carolina, United States.
J Med Imaging (Bellingham). 2025 May;12(3):035501. doi: 10.1117/1.JMI.12.3.035501. Epub 2025 May 21.
Breast cancer remains a leading cause of death for women. Screening programs are deployed to detect cancer at early stages. One current barrier identified by breast imaging researchers is a shortage of labeled image datasets. Addressing this problem is crucial to improve early detection models. We present an active learning (AL) framework for segmenting breast masses from 2D digital mammography, and we publish labeled data. Our method aims to reduce the input needed from expert annotators to reach a fully labeled dataset.
We create a dataset of 1136 mammographic masses with pixel-wise binary segmentation labels, with the test subset labeled independently by two different teams. With this dataset, we simulate a human annotator within an AL framework to develop and compare AI-assisted labeling methods, using a discriminator model and a simulated oracle to collect acceptable segmentation labels. A UNet model is retrained on these labels, generating new segmentations. We evaluate various oracle heuristics using the percentage of segmentations that the oracle relabels and measure the quality of the proposed labels by evaluating the intersection over union over a validation dataset.
Our method reduces expert annotator input by 44%. We present a dataset of 1136 binary segmentation labels approved by board-certified radiologists and make the 143-image validation set public for comparison with other researchers' methods.
We demonstrate that AL can significantly improve the efficiency and time-effectiveness of creating labeled mammogram datasets. Our framework facilitates the development of high-quality datasets while minimizing manual effort in the domain of digital mammography.
乳腺癌仍是女性死亡的主要原因。开展筛查项目以在早期阶段检测癌症。乳腺影像研究人员目前发现的一个障碍是缺乏带标注的图像数据集。解决这个问题对于改进早期检测模型至关重要。我们提出了一种用于从二维数字乳腺钼靶图像中分割乳腺肿块的主动学习(AL)框架,并发布了带标注的数据。我们的方法旨在减少专家标注人员为获得一个完全带标注的数据集所需的投入。
我们创建了一个包含1136个乳腺钼靶肿块的数据集,带有逐像素的二进制分割标注,测试子集由两个不同团队独立标注。利用这个数据集,我们在一个主动学习框架内模拟人类标注人员,以开发和比较人工智能辅助的标注方法,使用一个判别模型和一个模拟预言机来收集可接受的分割标注。在这些标注上对一个U-Net模型进行重新训练,生成新的分割结果。我们使用预言机重新标注的分割结果的百分比来评估各种预言机启发式方法,并通过在一个验证数据集上评估交并比来衡量所提出标注的质量。
我们的方法将专家标注人员的投入减少了44%。我们展示了一个由获得委员会认证的放射科医生批准的1136个二进制分割标注的数据集,并公开了143幅图像的验证集,以便与其他研究人员的方法进行比较。
我们证明主动学习可以显著提高创建带标注的乳腺钼靶数据集的效率和时效性。我们的框架有助于高质量数据集的开发,同时将数字乳腺钼靶领域的人工工作量降至最低。