Department of Radiology, Taean-gun Health Center and County Hospital, Taean-gun, Korea.
Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea.
Korean J Radiol. 2019 Mar;20(3):405-410. doi: 10.3348/kjr.2019.0025.
To evaluate the design characteristics of studies that evaluated the performance of artificial intelligence (AI) algorithms for the diagnostic analysis of medical images.
PubMed MEDLINE and Embase databases were searched to identify original research articles published between January 1, 2018 and August 17, 2018 that investigated the performance of AI algorithms that analyze medical images to provide diagnostic decisions. Eligible articles were evaluated to determine 1) whether the study used external validation rather than internal validation, and in case of external validation, whether the data for validation were collected, 2) with diagnostic cohort design instead of diagnostic case-control design, 3) from multiple institutions, and 4) in a prospective manner. These are fundamental methodologic features recommended for clinical validation of AI performance in real-world practice. The studies that fulfilled the above criteria were identified. We classified the publishing journals into medical vs. non-medical journal groups. Then, the results were compared between medical and non-medical journals.
Of 516 eligible published studies, only 6% (31 studies) performed external validation. None of the 31 studies adopted all three design features: diagnostic cohort design, the inclusion of multiple institutions, and prospective data collection for external validation. No significant difference was found between medical and non-medical journals.
Nearly all of the studies published in the study period that evaluated the performance of AI algorithms for diagnostic analysis of medical images were designed as proof-of-concept technical feasibility studies and did not have the design features that are recommended for robust validation of the real-world clinical performance of AI algorithms.
评估评估医学图像诊断分析人工智能(AI)算法性能的研究的设计特点。
检索 PubMed MEDLINE 和 Embase 数据库,以确定 2018 年 1 月 1 日至 2018 年 8 月 17 日期间发表的评估分析医学图像以提供诊断决策的 AI 算法性能的原始研究文章。评估合格文章以确定 1)研究是否使用外部验证而不是内部验证,并且在外部验证的情况下,验证数据是否被收集,2)采用诊断队列设计而不是诊断病例对照设计,3)来自多个机构,和 4)前瞻性。这些是在真实世界实践中对 AI 性能进行临床验证的基本方法学特征。确定符合上述标准的研究。我们将出版期刊分为医学和非医学期刊组。然后,将医学和非医学期刊的结果进行比较。
在 516 篇合格的已发表研究中,只有 6%(31 项研究)进行了外部验证。在 31 项研究中,没有一项采用了所有三个设计特征:诊断队列设计、多个机构的纳入以及前瞻性数据收集用于外部验证。医学和非医学期刊之间没有发现显著差异。
在评估医学图像诊断分析人工智能算法性能的研究中,几乎所有在研究期间发表的研究都是作为概念验证技术可行性研究设计的,并且不具有推荐用于稳健验证人工智能算法的真实世界临床性能的设计特征。