Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts.
American College of Radiology, Reston, Virginia.
J Am Coll Radiol. 2020 Dec;17(12):1653-1662. doi: 10.1016/j.jacr.2020.05.015. Epub 2020 Jun 24.
We developed deep learning algorithms to automatically assess BI-RADS breast density.
Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting.
Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class κ of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists.
We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence.
我们开发了深度学习算法,以自动评估 BI-RADS 乳腺密度。
利用来自数字乳腺成像筛查试验的 108230 例数字筛查乳房 X 光片的大型多机构患者队列,我们研究了数据、模型和训练参数对整体模型性能的影响,并提供了 ACR 2019 年会与会者的众包评估。
我们表现最好的算法与有资格解读乳房 X 光片的放射科医生达成了良好的一致性,四类κ值为 0.667。当使用从数据集中随机抽样的图像进行训练与从每个密度类别中抽样相同数量的图像进行训练时,模型预测会偏向于低患病率类别,例如极密乳房。其净结果是与随机抽样相比,预测致密乳房的敏感性增加,特异性降低。我们还发现,当我们在与我们训练的数字乳腺摄影数据格式不同的格式上评估模型性能时,模型性能会下降,这强调了多机构训练集的重要性。最后,我们表明,众包注释,包括那些常规阅读乳房 X 光片的注释,与我们的算法比与原始解释放射科医生的注释具有更高的一致性。
我们展示了可能影响模型性能的参数,以及众包如何用于评估。这项研究是与 ACR AI-LAB 一起进行的,这是一个人工智能民主化的平台。