Department of Physics and Astronomy, University of Calgary, 2500 University Dr NW, Calgary, Alberta, T2N 1N4, Canada. Department of Medical Physics, Tom Baker Cancer Centre, 1331 29 St NW, Calgary, Alberta, T2N 4N2, Canada. Author to whom any correspondence should be addressed.
Phys Med Biol. 2020 Mar 6;65(5):055014. doi: 10.1088/1361-6560/ab6e54.
Algorithm benchmarking and characterization are an important part of algorithm development and validation prior to clinical implementation. However, benchmarking may be limited to a small collection of test cases due to the resource-intensive nature of establishing 'ground-truth' references. This study proposes a framework for selecting test cases to assess algorithm and workflow equivalence. Effective test case selection may minimize the number of ground-truth comparisons required to establish robust and clinically relevant benchmarking and characterization results. To demonstrate the proposed framework, we clustered differences between two independent workflows estimating during-treatment dose objective violations for 15 head and neck cancer patients (15 planning CTs, 105 on-unit CBCTs). Each workflow used a different deformable image registration algorithm to estimate inter-fractional anatomy and contour changes. The Hopkins statistic tested whether workflow output was inherently clustered and k-medoid clustering formalized cluster assignment. Further statistical analyses verified the relevance of clusters to algorithm output. Data at cluster centers ('medoids') were considered as candidate test cases representative of workflow-relevant algorithm differences. The framework indicated that differences in estimated dose objective violations were naturally grouped (Hopkins = 0.75, providing 90% confidence). K-medoid clustering identified five clusters which stratified workflow differences (MANOVA: p < 0.001) in estimated parotid gland D50%, spinal cord/brainstem Dmax, and high dose CTV coverage dose violations (Kendall's tau: p < 0.05). Systematic algorithm differences resulting in workflow discrepancies were: parotid gland volumes (ANOVA: p < 0.001), external contour deformations (t-test: p = 0.022), and CTV-to-PTV margins (t-test: 0.009), respectively. Five candidate test cases were verified as representative of the five clusters. The framework successfully clustered workflow outputs and identified five test cases representative of clinically relevant algorithm discrepancies. This approach may improve the allocation of resources during the benchmarking and characterization process and the applicability of results to clinical data.
算法基准测试和特征描述是在将算法应用于临床之前进行开发和验证的重要部分。然而,由于建立“真实基准”参考的资源密集性质,基准测试可能仅限于一小部分测试用例。本研究提出了一种选择测试用例来评估算法和工作流程等效性的框架。有效的测试用例选择可以最大限度地减少建立稳健且与临床相关的基准测试和特征描述结果所需的真实基准比较数量。为了演示所提出的框架,我们对两种独立的工作流程进行了聚类,这两种工作流程用于估计 15 例头颈部癌症患者(15 次计划 CT、105 次在治疗期间的 CBCT)的治疗期间剂量目标违反情况。每个工作流程都使用不同的可变形图像配准算法来估计分次间解剖结构和轮廓变化。Hopkins 统计检验了工作流程输出是否固有地聚类,k-中值聚类正式确定了聚类分配。进一步的统计分析验证了聚类与算法输出的相关性。在聚类中心(“中值”)的数据被视为代表工作流程相关算法差异的候选测试用例。该框架表明,估计的剂量目标违反差异自然分组(Hopkins = 0.75,置信度为 90%)。k-中值聚类确定了五个聚类,这些聚类分层了工作流程差异(MANOVA:p < 0.001),包括估计的腮腺 D50%、脊髓/脑干 Dmax 和高剂量 CTV 覆盖剂量违反情况(Kendall's tau:p < 0.05)。导致工作流程差异的系统算法差异是:腮腺体积(ANOVA:p < 0.001)、外部轮廓变形(t 检验:p = 0.022)和 CTV-到-PTV 边界(t 检验:0.009)。五个候选测试用例被验证为五个聚类的代表。该框架成功地对工作流程输出进行了聚类,并确定了五个代表临床相关算法差异的测试用例。这种方法可以提高基准测试和特征描述过程中的资源分配效率,并提高结果在临床数据中的适用性。