Department of Community & Family Medicine, Norris Cotton Cancer Center, Lebanon, NH, USA.
Acad Radiol. 2013 Jun;20(6):731-9. doi: 10.1016/j.acra.2013.01.012.
Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography.
Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images.
Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases.
Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.
评估和提高放射图像解读能力的测试集已经使用了几十年,通常是通过专家的金标准解读来评估其性能。为了评估筛查性乳房 X 线摄影的测试集,需要有一个金标准来确定女性是否需要进行额外的检查,因为在乳房 X 线上,间隔期癌症可能是隐匿的,而一些最终确定为良性的发现需要进行额外的影像学检查来确定是否需要进行活检。使用专家来设定金标准假设他们的解读很少存在差异,但这在乳房 X 线摄影中尚未得到明确研究。
我们使用了在乳腺癌监测联盟中进行的 314 例筛查性乳房 X 线摄影检查(n=143 例癌症病例)的数字化胶片,评估了三位独立评估每例检查是否需要召回的专家放射科医生之间的解读一致性,以及在召回图像中的病变位置、发现类型(肿块、钙化、不对称密度或结构扭曲)和解读难度。
对于癌症病例,三位专家之间关于召回/不召回的一致性(平均 74.3±6.5)高于非癌症病例(平均 62.6±7.1)。对于癌症病例,召回、病变位置、发现类型和难度的完全一致性范围为 36.4%至 42.0%,对于非癌症病例为 43.9%至 65.6%。三位专家中的两位对 95.1%的癌症病例和 91.8%的非癌症病例的召回和病变位置达成一致,但三位专家仅对 55.2%的癌症病例和 42.1%的非癌症病例达成一致。
专家解读的变异性是显著的。对于建立任何测试集的金标准解读,特别是对于非癌症病例,应至少使用三位独立专家并结合共识。