Andersen Eline Sandvig, Birk-Korch Johan Baden, Hansen Rasmus Søgaard, Fly Line Haugaard, Röttger Richard, Arcani Diana Maria Cespedes, Brasen Claus Lohman, Brandslund Ivan, Madsen Jonna Skov
Department of Biochemistry and Immunology, Lillebaelt Hospital - University Hospital of Southern Denmark, Vejle, Denmark.
Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark.
JBI Evid Synth. 2024 Dec 1;22(12):2423-2446. doi: 10.11124/JBIES-24-00042.
The objective of this review was to provide an overview of the diverse methods described, tested, or implemented for monitoring performance of clinical artificial intelligence (AI) systems, while also summarizing the arguments given for or against these methods.
The integration of AI in clinical decision-making is steadily growing. Performances of AI systems evolve over time, necessitating ongoing performance monitoring. However, the evidence on specific monitoring methods is sparse and heterogeneous. Thus, an overview of the evidence on this topic is warranted to guide further research on clinical AI monitoring.
We included publications detailing metrics or statistical processes employed in systematic, continuous, or repeated initiatives aimed at evaluating or predicting the clinical performance of AI models with direct implications for patient management in health care. No limitations on language or publication date were enforced.
We performed systematic database searches in MEDLINE (Ovid), Embase (Ovid), Scopus, and ProQuest Dissertations and Theses Global, supplemented by backward and forward citation searches and gray literature searches. Two or more independent reviewers conducted title and abstract screening, full-text evaluation, and data extraction using a tool developed by the authors. During extraction, the methods identified were divided into subcategories. The results are presented narratively and summarized in tables and graphs.
Thirty-nine sources of evidence were included in the review, with the most abundant source types being opinion papers/narrative reviews (33%) and simulation studies (33%). One guideline on the topic was identified, offering limited guidance on specific metrics and statistical methods. The number of sources included increased year by year, with almost 4 times as many sources included in 2023 compared with 2019. The most commonly reported performance metrics were traditional metrics from the medical literature, including area under the receiver operating characteristics curve (AUROC), sensitivity, specificity, and predictive values, although few arguments were given supporting these choices. Some studies reported on metrics and statistical processing specifically designed to monitor clinical AI.
This review provides a summary of the methods described for monitoring AI in health care. It reveals a relative scarcity of evidence and guidance for specific practical implementation of performance monitoring of clinical AI. This underscores the imperative for further research, discussion, and guidance regarding the specifics of implementing monitoring for clinical AI. The steady increase in the number of relevant sources published per year suggests that this area of research is gaining increased focus, and the amount of evidence and guidance available will likely increase significantly over the coming years.
Open Science Framework https://osf.io/afkrn.
本综述的目的是概述已描述、测试或实施的用于监测临床人工智能(AI)系统性能的各种方法,同时总结支持或反对这些方法的论据。
AI在临床决策中的整合正在稳步发展。AI系统的性能会随时间演变,因此需要持续进行性能监测。然而,关于具体监测方法的证据稀少且参差不齐。因此,有必要对该主题的证据进行概述,以指导临床AI监测的进一步研究。
我们纳入了详细介绍在系统、持续或重复的举措中所采用的指标或统计过程的出版物,这些举措旨在评估或预测对医疗保健中患者管理有直接影响的AI模型的临床性能。未对语言或出版日期加以限制。
我们在MEDLINE(Ovid)、Embase(Ovid)、Scopus和ProQuest Dissertations and Theses Global中进行了系统的数据库检索,并辅以向后和向前的引文检索以及灰色文献检索。两名或更多独立评审员使用作者开发的工具进行标题和摘要筛选、全文评估以及数据提取。在提取过程中,所确定的方法被分为子类别。结果以叙述形式呈现,并汇总在表格和图表中。
本综述纳入了39个证据来源,其中最丰富的来源类型是观点论文/叙述性综述(33%)和模拟研究(33%)。确定了一项关于该主题的指南,该指南对具体指标和统计方法的指导有限。纳入的来源数量逐年增加,2023年纳入的来源数量几乎是2019年的4倍。最常报告的性能指标是医学文献中的传统指标,包括受试者操作特征曲线下面积(AUROC)、敏感性、特异性和预测值,不过支持这些选择的论据很少。一些研究报告了专门设计用于监测临床AI的指标和统计处理方法。
本综述总结了所描述的用于医疗保健中监测AI的方法。它揭示了在临床AI性能监测的具体实际实施方面,证据和指导相对匮乏。这凸显了就临床AI监测实施细节进行进一步研究、讨论和指导的紧迫性。每年发表的相关来源数量稳步增加,表明该研究领域正受到越来越多的关注,未来几年可用的证据和指导数量可能会大幅增加。
开放科学框架https://osf.io/afkrn。