Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada.
Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada; Department of Physics and Astronomy, University of British Columbia, Vancouver, Canada.
J Med Imaging Radiat Sci. 2024 Dec;55(4):101745. doi: 10.1016/j.jmir.2024.101745. Epub 2024 Aug 29.
The reproducibility crisis in AI research remains a significant concern. While code sharing has been acknowledged as a step toward addressing this issue, our focus extends beyond this paradigm. In this work, we explore "federated testing" as an avenue for advancing reproducible AI research and development especially in medical imaging. Unlike federated learning, where a model is developed and refined on data from different centers, federated testing involves models developed by one team being deployed and evaluated by others, addressing reproducibility across various implementations.
Our study follows an exploratory design aimed at systematically evaluating the sources of discrepancies in shared model execution for medical imaging and outputs on the same input data, independent of generalizability analysis. We distributed the same model code to multiple independent centers, monitoring execution in different runtime environments while considering various real-world scenarios for pre- and post-processing steps. We analyzed deployment infrastructure by comparing the impact of different computational resources (GPU vs. CPU) on model performance. To assess federated testing in AI models for medical imaging, we performed a comparative evaluation across different centers, each with distinct pre- and post-processing steps and deployment environments, specifically targeting AI-driven positron emission tomography (PET) imaging segmentation. More specifically, we studied federated testing for an AI-based model for surrogate total metabolic tumor volume (sTMTV) segmentation in PET imaging: the AI algorithm, trained on maximum intensity projection (MIP) data, segments lymphoma regions and estimates sTMTV.
Our study reveals that relying solely on open-source code sharing does not guarantee reproducible results due to variations in code execution, runtime environments, and incomplete input specifications. Deploying the segmentation model on local and virtual GPUs compared to using Docker containers showed no effect on reproducibility. However, significant sources of variability were found in data preparation and pre-/post- processing techniques for PET imaging. These findings underscore the limitations of code sharing alone in achieving consistent and accurate results in federated testing.
Achieving consistently precise results in federated testing requires more than just sharing models through open-source code. Comprehensive pipeline sharing, including pre- and post-processing steps, is essential. Cloud-based platforms that automate these processes can streamline AI model testing across diverse locations. Standardizing protocols and sharing complete pipelines can significantly enhance the robustness and reproducibility of AI models.
人工智能研究中的可重复性危机仍然是一个重大问题。虽然代码共享已被认为是解决此问题的一种方法,但我们的关注点不仅于此。在这项工作中,我们探索了“联邦测试”作为推进可重复的人工智能研究和开发的途径,特别是在医学成像领域。与联邦学习不同,联邦学习是在来自不同中心的数据上开发和完善模型,联邦测试涉及由一个团队开发的模型由其他团队部署和评估,解决了各种实现中的可重复性问题。
我们的研究采用探索性设计,旨在系统地评估在医学成像中共享模型执行和在相同输入数据上的输出的差异来源,而不考虑可泛化性分析。我们将相同的模型代码分发给多个独立的中心,在不同的运行时环境中监控执行情况,同时考虑各种实际的预处理和后处理步骤。我们通过比较不同计算资源(GPU 与 CPU)对模型性能的影响来分析部署基础设施。为了评估医学成像人工智能模型中的联邦测试,我们在不同的中心进行了比较评估,每个中心都有独特的预处理和后处理步骤以及部署环境,特别是针对人工智能驱动的正电子发射断层扫描(PET)成像分割。更具体地说,我们研究了人工智能驱动的正电子发射断层扫描(PET)成像中替代总代谢肿瘤体积(sTMTV)分割的人工智能模型的联邦测试:该人工智能算法基于最大强度投影(MIP)数据进行训练,分割淋巴瘤区域并估计 sTMTV。
我们的研究表明,仅仅依靠开源代码共享并不能保证可重复的结果,因为代码执行、运行时环境和不完整的输入规范会导致差异。与使用 Docker 容器相比,在本地和虚拟 GPU 上部署分割模型对可重复性没有影响。然而,在 PET 成像的数据准备和预处理/后处理技术方面发现了显著的变异性来源。这些发现突显了仅通过开源代码共享在联邦测试中实现一致和准确结果的局限性。
在联邦测试中实现一致的精确结果需要的不仅仅是通过开源代码共享模型。全面的管道共享,包括预处理和后处理步骤,是必不可少的。基于云的平台可以自动化这些过程,从而简化跨不同位置的人工智能模型测试。标准化协议和共享完整的管道可以显著提高人工智能模型的稳健性和可重复性。