Medical Research Scholars Program Fellow, Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland.
Staff Scientist, Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland.
J Am Coll Radiol. 2023 Feb;20(2):134-145. doi: 10.1016/j.jacr.2022.05.022. Epub 2022 Jul 31.
To determine the rigor, generalizability, and reproducibility of published classification and detection artificial intelligence (AI) models for prostate cancer (PCa) on MRI using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) guidelines, a 42-item checklist that is considered a measure of best practice for presenting and reviewing medical imaging AI research.
This review searched English literature for studies proposing PCa AI detection and classification models on MRI. Each study was evaluated with the CLAIM checklist. The additional outcomes for which data were sought included measures of AI model performance (eg, area under the curve [AUC], sensitivity, specificity, free-response operating characteristic curves), training and validation and testing group sample size, AI approach, detection versus classification AI, public data set utilization, MRI sequences used, and definition of gold standard for ground truth. The percentage of CLAIM checklist fulfillment was used to stratify studies into quartiles. Wilcoxon's rank-sum test was used for pair-wise comparisons.
In all, 75 studies were identified, and 53 studies qualified for analysis. The original CLAIM items that most studies did not fulfill includes item 12 (77% no): de-identification methods; item 13 (68% no): handling missing data; item 15 (47% no): rationale for choosing ground truth reference standard; item 18 (55% no): measurements of inter- and intrareader variability; item 31 (60% no): inclusion of validated interpretability maps; item 37 (92% no): inclusion of failure analysis to elucidate AI model weaknesses. An AUC score versus percentage CLAIM fulfillment quartile revealed a significant difference of the mean AUC scores between quartile 1 versus quartile 2 (0.78 versus 0.86, P = .034) and quartile 1 versus quartile 4 (0.78 versus 0.89, P = .003) scores. Based on additional information and outcome metrics gathered in this study, additional measures of best practice are defined. These new items include disclosure of public dataset usage, ground truth definition in comparison to other referenced works in the defined task, and sample size power calculation.
A large proportion of AI studies do not fulfill key items in CLAIM guidelines within their methods and results sections. The percentage of CLAIM checklist fulfillment is weakly associated with improved AI model performance. Additions or supplementations to CLAIM are recommended to improve publishing standards and aid reviewers in determining study rigor.
使用人工智能医学成像检查表(CLAIM)准则,确定已发表的前列腺癌(PCa)MRI 分类和检测人工智能(AI)模型的严谨性、泛化能力和可重复性,CLAIM 是一个包含 42 个项目的检查表,被认为是展示和审查医学成像 AI 研究的最佳实践标准。
本研究检索了提出基于 MRI 的 PCa AI 检测和分类模型的英文文献。每个研究都使用 CLAIM 检查表进行评估。此外,还寻求了 AI 模型性能的评估指标(如曲线下面积[AUC]、敏感度、特异度、自由响应操作特征曲线)、训练和验证以及测试组样本量、AI 方法、检测与分类 AI、公共数据集利用、使用的 MRI 序列以及金标准的定义。CLAIM 检查表的完成百分比用于将研究分为四分位组。采用 Wilcoxon 秩和检验进行两两比较。
共确定了 75 项研究,其中 53 项研究符合分析要求。大多数研究未满足的原始 CLAIM 项目包括:项目 12(77%未满足):去识别方法;项目 13(68%未满足):处理缺失数据;项目 15(47%未满足):选择金标准参考标准的理由;项目 18(55%未满足):测量读者间和读者内变异性的方法;项目 31(60%未满足):纳入验证解释性图谱;项目 37(92%未满足):纳入失败分析以阐明 AI 模型的弱点。AUC 评分与 CLAIM 完成百分比四分位数的关系显示,四分位 1 与四分位 2(0.78 与 0.86,P=0.034)和四分位 1 与四分位 4(0.78 与 0.89,P=0.003)的 AUC 评分之间存在显著差异。基于本研究中收集的其他信息和结果指标,定义了其他最佳实践的衡量标准。这些新项目包括披露公共数据集的使用情况、金标准定义与定义任务中其他参考作品的比较,以及样本量功效计算。
很大一部分 AI 研究在其方法和结果部分没有满足 CLAIM 指南中的关键项目。CLAIM 检查表的完成百分比与 AI 模型性能的提高呈弱相关。建议对 CLAIM 进行补充或补充,以提高出版标准,并帮助审稿人确定研究的严谨性。