Huang Ziyan, Deng Zhongying, Ye Jin, Wang Haoyu, Su Yanzhou, Li Tianbin, Sun Hui, Cheng Junlong, Chen Jianpin, He Junjun, Gu Yun, Zhang Shaoting, Gu Lixu, Qiao Yu
Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, 200240, China; School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200000, China.
Shanghai Artificial Intelligence Laboratory, Shanghai, 200000, China; Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, CB2 1TN, United Kingdom.
Med Image Anal. 2025 Apr;101:103499. doi: 10.1016/j.media.2025.103499. Epub 2025 Feb 14.
Although deep learning has revolutionized abdominal multi-organ segmentation, its models often struggle with generalization due to training on small-scale, specific datasets and modalities. The recent emergence of large-scale datasets may mitigate this issue, but some important questions remain unsolved: Can models trained on these large datasets generalize well across different datasets and imaging modalities? If yes/no, how can we further improve their generalizability? To address these questions, we introduce A-Eval, a benchmark for the cross-dataset and cross-modality Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation, integrating seven datasets across CT and MRI modalities. Our evaluations indicate that significant domain gaps persist despite larger data scales. While increased datasets improve generalization, model performance on unseen data remains inconsistent. Joint training across multiple datasets and modalities enhances generalization, though annotation inconsistencies pose challenges. These findings highlight the need for diverse and well-curated training data across various clinical scenarios and modalities to develop robust medical imaging models. The code and pre-trained models are available at https://github.com/uni-medical/A-Eval.
尽管深度学习给腹部多器官分割带来了变革,但其模型由于在小规模、特定数据集和模态上进行训练,往往难以实现泛化。最近大规模数据集的出现可能会缓解这个问题,但一些重要问题仍未解决:在这些大型数据集上训练的模型能否在不同数据集和成像模态之间实现良好的泛化?如果能/不能,我们如何进一步提高它们的泛化能力?为了解决这些问题,我们引入了A-Eval,这是一个用于腹部(“A”)多器官分割的跨数据集和跨模态评估(“Eval”)的基准,整合了七个CT和MRI模态的数据集。我们的评估表明,尽管数据规模更大,但显著的领域差距仍然存在。虽然增加数据集可以提高泛化能力,但模型在未见数据上的性能仍然不一致。跨多个数据集和模态的联合训练增强了泛化能力,不过标注不一致带来了挑战。这些发现凸显了在各种临床场景和模态中需要多样化且精心策划的训练数据,以开发强大的医学成像模型。代码和预训练模型可在https://github.com/uni-medical/A-Eval获取。