Li Jieyu, Udupa Jayaram K, Tong Yubing, Wang Lisheng, Torigian Drew A
Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiao Tong University, 800 Dongchuan RD, Shanghai, 200240, China; Medical Image Processing Group, Department of Radiology, University of Pennsylvania, 602 Goddard building, 3710 Hamilton Walk, Philadelphia, PA, 19104, United States.
Medical Image Processing Group, Department of Radiology, University of Pennsylvania, 602 Goddard building, 3710 Hamilton Walk, Philadelphia, PA, 19104, United States.
Med Image Anal. 2021 Apr;69:101980. doi: 10.1016/j.media.2021.101980. Epub 2021 Jan 26.
Fully annotated data sets play important roles in medical image segmentation and evaluation. Expense and imprecision are the two main issues in generating ground truth (GT) segmentations. In this paper, in an attempt to overcome these two issues jointly, we propose a method, named SparseGT, which exploit variability among human segmenters to maximally save manual workload in GT generation for evaluating actual segmentations by algorithms. Pseudo ground truth (p-GT) segmentations are created by only a small fraction of workload and with human-level perfection/imperfection, and they can be used in practice as a substitute for fully manual GT in evaluating segmentation algorithms at the same precision. p-GT segmentations are generated by first selecting slices sparsely, where manual contouring is conducted only on these sparse slices, and subsequently filling segmentations on other slices automatically. By creating p-GT with different levels of sparseness, we determine the largest workload reduction achievable for each considered object, where the variability of the generated p-GT is statistically indistinguishable from inter-segmenter differences in full manual GT segmentations for that object. Furthermore, we investigate the segmentation evaluation errors introduced by variability in manual GT by applying p-GT in evaluation of actual segmentations by an algorithm. Experiments are conducted on ∼500 computed tomography (CT) studies involving six objects in two body regions, Head & Neck and Thorax, where optimal sparseness and corresponding evaluation errors are determined for each object and each strategy. Our results indicate that creating p-GT by the concatenated strategy of uniformly selecting sparse slices and filling segmentations via deep-learning (DL) network show highest manual workload reduction by ∼80-96% without sacrificing evaluation accuracy compared to fully manual GT. Nevertheless, other strategies also have obvious contributions in different situations. A non-uniform strategy for slice selection shows its advantage for objects with irregular shape change from slice to slice. An interpolation strategy for filling segmentations can achieve ∼60-90% of workload reduction in simulating human-level GT without the need of an actual training stage and shows potential in enlarging data sets for training p-GT generation networks. We conclude that not only over 90% reduction in workload is feasible without sacrificing evaluation accuracy but also the suitable strategy and the optimal sparseness level achievable for creating p-GT are object- and application-specific.
带完整注释的数据集在医学图像分割和评估中发挥着重要作用。成本和不精确性是生成真实(GT)分割的两个主要问题。在本文中,为了共同克服这两个问题,我们提出了一种名为SparseGT的方法,该方法利用人类分割者之间的差异,在生成用于评估算法实际分割的GT时最大限度地减少人工工作量。伪真实(p-GT)分割仅通过一小部分工作量创建,具有人类水平的完美/不完美,并且在以相同精度评估分割算法时,它们可以在实践中用作完全手动GT的替代品。p-GT分割的生成首先是稀疏地选择切片,仅在这些稀疏切片上进行手动轮廓绘制,然后自动填充其他切片上的分割。通过创建具有不同稀疏程度的p-GT,我们确定了每个考虑对象可实现的最大工作量减少,其中生成的p-GT的变异性在统计上与该对象完全手动GT分割中的分割者间差异无法区分。此外,我们通过将p-GT应用于算法对实际分割的评估,研究了手动GT变异性引入的分割评估误差。在涉及头部和颈部以及胸部两个身体区域中的六个对象的约500个计算机断层扫描(CT)研究上进行了实验,为每个对象和每种策略确定了最佳稀疏度和相应的评估误差。我们的结果表明,与完全手动GT相比,通过均匀选择稀疏切片和通过深度学习(DL)网络填充分割的串联策略创建p-GT可显示出最高约80 - 96%的人工工作量减少,而不会牺牲评估准确性。然而,其他策略在不同情况下也有明显贡献。切片选择的非均匀策略对于切片间形状变化不规则的对象显示出其优势。用于填充分割的插值策略在模拟人类水平的GT时可以实现约60 - 90%的工作量减少,而无需实际训练阶段,并且在扩大用于训练p-GT生成网络的数据集方面显示出潜力。我们得出结论,不仅在不牺牲评估准确性的情况下减少超过90%的工作量是可行的,而且创建p-GT的合适策略和可实现的最佳稀疏度水平是特定于对象和应用的。