Wahid Kareem A, Sahin Onur, Kundu Suprateek, Lin Diana, Alanis Anthony, Tehami Salik, Kamel Serageldin, Duke Simon, Sherer Michael V, Rasmussen Mathis, Korreman Stine, Fuentes David, Cislo Michael, Nelms Benjamin E, Christodouleas John P, Murphy James D, Mohamed Abdallah S R, He Renjie, Naser Mohammed A, Gillespie Erin F, Fuller Clifton D
Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA.
Department of Imaging Physics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA.
medRxiv. 2023 Sep 5:2023.08.30.23294786. doi: 10.1101/2023.08.30.23294786.
Medical image auto-segmentation is poised to revolutionize radiotherapy workflows. The quality of auto-segmentation training data, primarily derived from clinician observers, is of utmost importance. However, the factors influencing the quality of these clinician-derived segmentations have yet to be fully understood or quantified. Therefore, the purpose of this study was to determine the role of common observer demographic variables on quantitative segmentation performance.
Organ at risk (OAR) and tumor volume segmentations provided by radiation oncologist observers from the Contouring Collaborative for Consensus in Radiation Oncology public dataset were utilized for this study. Segmentations were derived from five separate disease sites comprised of one patient case each: breast, sarcoma, head and neck (H&N), gynecologic (GYN), and gastrointestinal (GI). Segmentation quality was determined on a structure-by-structure basis by comparing the observer segmentations with an expert-derived consensus gold standard primarily using the Dice Similarity Coefficient (DSC); surface DSC was investigated as a secondary metric. Metrics were stratified into binary groups based on previously established structure-specific expert-derived interobserver variability (IOV) cutoffs. Generalized linear mixed-effects models using Markov chain Monte Carlo Bayesian estimation were used to investigate the association between demographic variables and the binarized segmentation quality for each disease site separately. Variables with a highest density interval excluding zero - loosely analogous to frequentist significance - were considered to substantially impact the outcome measure.
After filtering by practicing radiation oncologists, 574, 110, 452, 112, and 48 structure observations remained for the breast, sarcoma, H&N, GYN, and GI cases, respectively. The median percentage of observations that crossed the expert DSC IOV cutoff when stratified by structure type was 55% and 31% for OARs and tumor volumes, respectively. Bayesian regression analysis revealed tumor category had a substantial negative impact on binarized DSC for the breast (coefficient mean ± standard deviation: -0.97 ± 0.20), sarcoma (-1.04 ± 0.54), H&N (-1.00 ± 0.24), and GI (-2.95 ± 0.98) cases. There were no clear recurring relationships between segmentation quality and demographic variables across the cases, with most variables demonstrating large standard deviations and wide highest density intervals.
Our study highlights substantial uncertainty surrounding conventionally presumed factors influencing segmentation quality. Future studies should investigate additional demographic variables, more patients and imaging modalities, and alternative metrics of segmentation acceptability.
医学图像自动分割有望彻底改变放射治疗工作流程。自动分割训练数据的质量至关重要,这些数据主要来自临床医生的观察。然而,影响这些临床医生分割质量的因素尚未得到充分理解或量化。因此,本研究的目的是确定常见观察者人口统计学变量对定量分割性能的作用。
本研究使用了来自放射肿瘤学共识轮廓协作组公共数据集中放射肿瘤学家观察者提供的危及器官(OAR)和肿瘤体积分割数据。分割数据来自五个不同的疾病部位,每个部位包含一个患者病例:乳腺、肉瘤、头颈部(H&N)、妇科(GYN)和胃肠道(GI)。通过将观察者的分割结果与主要使用骰子相似系数(DSC)的专家得出的共识金标准进行比较,逐结构确定分割质量;表面DSC作为次要指标进行研究。根据先前建立的特定结构专家得出的观察者间变异性(IOV)临界值,将指标分层为二元组。使用马尔可夫链蒙特卡罗贝叶斯估计的广义线性混合效应模型分别研究每个疾病部位人口统计学变量与二值化分割质量之间的关联。最高密度区间不包括零的变量——大致类似于频率主义显著性——被认为对结果测量有重大影响。
经过放射肿瘤学执业医生筛选后,乳腺、肉瘤、H&N、GYN和GI病例分别剩余574、110、452、112和48个结构观察值。按结构类型分层时,超过专家DSC IOV临界值的观察值中位数,OARs和肿瘤体积分别为55%和31%。贝叶斯回归分析显示,肿瘤类别对乳腺(系数均值±标准差:-0.97±0.20)、肉瘤(-1.04±0.54)、H&N(-1.00±0.24)和GI(-2.95±0.98)病例的二值化DSC有重大负面影响。各病例中分割质量与人口统计学变量之间没有明显的重复关系,大多数变量显示出较大的标准差和较宽的最高密度区间。
我们的研究突出了围绕传统上认为影响分割质量的因素存在的重大不确定性。未来的研究应调查更多的人口统计学变量、更多的患者和成像模态,以及分割可接受性的替代指标。