Prabhudesai Snehal, Wang Nicholas Chandler, Ahluwalia Vinayak, Huan Xun, Bapuraj Jayapalli Rajiv, Banovic Nikola, Rao Arvind
Computer Science and Engineering, University of Michigan, Ann Arbor, MI, United States.
Computational Medicine and Bioinformatics, Michigan Medicine, Ann Arbor, MI, United States.
Front Neurosci. 2021 Oct 6;15:740353. doi: 10.3389/fnins.2021.740353. eCollection 2021.
Accurate and consistent segmentation plays an important role in the diagnosis, treatment planning, and monitoring of both High Grade Glioma (HGG), including Glioblastoma Multiforme (GBM), and Low Grade Glioma (LGG). Accuracy of segmentation can be affected by the imaging presentation of glioma, which greatly varies between the two tumor grade groups. In recent years, researchers have used Machine Learning (ML) to segment tumor rapidly and consistently, as compared to manual segmentation. However, existing ML validation relies heavily on computing summary statistics and rarely tests the generalizability of an algorithm on clinically heterogeneous data. In this work, our goal is to investigate how to holistically evaluate the performance of ML algorithms on a brain tumor segmentation task. We address the need for rigorous evaluation of ML algorithms and present four axes of model evaluation-diagnostic performance, model confidence, robustness, and data quality. We perform a comprehensive evaluation of a glioma segmentation ML algorithm by stratifying data by specific tumor grade groups (GBM and LGG) and evaluate these algorithms on each of the four axes. The main takeaways of our work are-(1) ML algorithms need to be evaluated on out-of-distribution data to assess generalizability, reflective of tumor heterogeneity. (2) Segmentation metrics alone are limited to evaluate the errors made by ML algorithms and their describe their consequences. (3) Adoption of tools in other domains such as robustness (adversarial attacks) and model uncertainty (prediction intervals) lead to a more comprehensive performance evaluation. Such a holistic evaluation framework could shed light on an algorithm's clinical utility and help it evolve into a more clinically valuable tool.
准确且一致的分割在高级别胶质瘤(HGG),包括多形性胶质母细胞瘤(GBM)和低级别胶质瘤(LGG)的诊断、治疗规划及监测中发挥着重要作用。分割的准确性可能会受到胶质瘤成像表现的影响,而这在两个肿瘤级别组之间差异很大。近年来,与手动分割相比,研究人员已使用机器学习(ML)来快速且一致地分割肿瘤。然而,现有的ML验证严重依赖于计算汇总统计量,很少测试算法在临床异质性数据上的泛化能力。在这项工作中,我们的目标是研究如何全面评估ML算法在脑肿瘤分割任务上的性能。我们满足了对ML算法进行严格评估的需求,并提出了模型评估的四个维度——诊断性能、模型置信度、稳健性和数据质量。我们通过按特定肿瘤级别组(GBM和LGG)对数据进行分层,对一种胶质瘤分割ML算法进行了全面评估,并在这四个维度上分别评估了这些算法。我们工作的主要要点是——(1)需要在分布外数据上评估ML算法,以评估其泛化能力,这反映了肿瘤的异质性。(2)仅靠分割指标来评估ML算法所犯的错误及其后果是有限的。(3)采用其他领域的工具,如稳健性(对抗性攻击)和模型不确定性(预测区间),可带来更全面的性能评估。这样一个全面的评估框架可以阐明算法的临床效用,并帮助其发展成为一个更具临床价值的工具。