Mayo Clinic Alix School of Medicine, Mayo Clinic, Rochester, MN, USA.
Division of Nephrology and Hypertension, Mayo Clinic, Rochester, MN, USA.
J Digit Imaging. 2023 Aug;36(4):1770-1781. doi: 10.1007/s10278-023-00804-1. Epub 2023 Mar 17.
The aim of this study is to investigate the use of an exponential-plateau model to determine the required training dataset size that yields the maximum medical image segmentation performance. CT and MR images of patients with renal tumors acquired between 1997 and 2017 were retrospectively collected from our nephrectomy registry. Modality-based datasets of 50, 100, 150, 200, 250, and 300 images were assembled to train models with an 80-20 training-validation split evaluated against 50 randomly held out test set images. A third experiment using the KiTS21 dataset was also used to explore the effects of different model architectures. Exponential-plateau models were used to establish the relationship of dataset size to model generalizability performance. For segmenting non-neoplastic kidney regions on CT and MR imaging, our model yielded test Dice score plateaus of [Formula: see text] and [Formula: see text] with the number of training-validation images needed to reach the plateaus of 54 and 122, respectively. For segmenting CT and MR tumor regions, we modeled a test Dice score plateau of [Formula: see text] and [Formula: see text], with 125 and 389 training-validation images needed to reach the plateaus. For the KiTS21 dataset, the best Dice score plateaus for nn-UNet 2D and 3D architectures were [Formula: see text] and [Formula: see text] with number to reach performance plateau of 177 and 440. Our research validates that differing imaging modalities, target structures, and model architectures all affect the amount of training images required to reach a performance plateau. The modeling approach we developed will help future researchers determine for their experiments when additional training-validation images will likely not further improve model performance.
本研究旨在探讨使用指数-平台模型来确定产生最大医学图像分割性能所需的训练数据集大小。回顾性地从我们的肾切除术登记处收集了 1997 年至 2017 年间患有肾肿瘤的患者的 CT 和 MR 图像。基于模态的数据集为 50、100、150、200、250 和 300 张图像,用于训练模型,使用 80-20 的训练-验证分割,对 50 张随机保留的测试集图像进行评估。还使用了 KiTS21 数据集的第三个实验来探索不同模型架构的影响。使用指数-平台模型来建立数据集大小与模型泛化性能的关系。对于在 CT 和 MR 成像上分割非肿瘤性肾区,我们的模型在测试集上的 Dice 评分达到 [Formula: see text] 和 [Formula: see text] 的平台,分别需要 54 和 122 张训练-验证图像才能达到平台。对于分割 CT 和 MR 肿瘤区域,我们建模的测试集 Dice 评分平台为 [Formula: see text] 和 [Formula: see text],分别需要 125 和 389 张训练-验证图像才能达到平台。对于 KiTS21 数据集,nn-UNet 2D 和 3D 架构的最佳 Dice 评分平台分别为 [Formula: see text] 和 [Formula: see text],达到性能平台的数量分别为 177 和 440。我们的研究验证了不同的成像方式、目标结构和模型架构都会影响达到性能平台所需的训练图像数量。我们开发的建模方法将帮助未来的研究人员确定他们的实验中,增加训练-验证图像是否可能不会进一步提高模型性能。