Section of Medical Physics and Engineering, Kanagawa Cancer Center, Yokohama, Japan.
Department of Radiation Oncology, Saitama Medical Center, Saitama Medical University, Kawagoe, Japan.
Med Phys. 2022 Jan;49(1):727-741. doi: 10.1002/mp.15393. Epub 2021 Dec 13.
The purpose of this study is to evaluate the prediction and classification performances of the gamma passing rate (GPR) for different machine learning models and to select the best model for achieving machine learning-based patient-specific quality assurance (PSQA).
The measurement verification of 356 head-and-neck volumetric modulated arc therapy plans was performed using a diode array phantom (Delta4 Phantom), and GPR values at 2%/2 mm with global normalization and 3%/2 mm with local normalization were calculated. Machine learning models, including ridge regression (RIDGE), random forest (RF), support vector regression (SVR), and stacked generalization (STACKING), were used to predict the GPR. Each machine learning model was trained using 260 plans, and the prediction accuracy was evaluated using the remaining 96 plans. The prediction error between the measured and predicted GPR was evaluated. For the classification evaluation, the lower control limit for the measured GPR and lower control limit for predicted GPR (LCL ) was defined to identify whether the GPR values represent a "pass" or a "fail." LCL values with 99% and 99.9% confidence levels were calculated as the upper prediction limits for the GPR estimated from the linear regression between the measured and predicted GPR.
There was an overestimation trend of the low measured GPR. The maximum prediction errors for RIDGE, RF, SVR, and STACKING were 3.2%, 2.9%, 2.3%, and 2.2% at the global 2%/2 mm and 6.3%, 6.6%, 6.1%, and 5.5% at the local 3%/2 mm, respectively. In the global 2%/2 mm, the sensitivity was 100% for all the machine learning models except RIDGE when using 99% LCL . The specificity was 76.1% for RIDGE, RF, and SVR and 66.3% for STACKING; however, the specificity decreased dramatically when 99.9% LCL was used. In the local 3%/2 mm, however, only STACKING showed 100% sensitivity when using 99% LCL . The decrease in the specificity using 99.9% LCL was smaller than that in the global 2%/2 mm, and the specificity for RIDGE, RF, SVR, and STACKING was 61.3%, 61.3%, 72.0%, and 66.8%, respectively.
STACKING had better prediction accuracy for low GPR values than other machine learning models. Applying LCL to a regression model enabled the consistent evaluation of quantitative and qualitative GPR predictions. Adjusting the confidence level of the LCL helped improve the balance between the sensitivity and specificity. We suggest that STACKING can assist the safe and efficient operation of PSQA.
本研究旨在评估不同机器学习模型的伽马通过率(GPR)的预测和分类性能,并选择最佳模型以实现基于机器学习的患者特定质量保证(PSQA)。
使用二极管阵列体模(Delta4 体模)对 356 个头颈部调强弧形治疗计划进行测量验证,并计算全局归一化的 2%/2mm 和局部归一化的 3%/2mm 的 GPR 值。使用岭回归(RIDGE)、随机森林(RF)、支持向量回归(SVR)和堆叠泛化(STACKING)等机器学习模型来预测 GPR。每个机器学习模型都使用 260 个计划进行训练,并用剩余的 96 个计划来评估预测准确性。评估了测量和预测 GPR 之间的预测误差。对于分类评估,将测量 GPR 的下控制限和预测 GPR 的下控制限(LCL)定义为确定 GPR 值是否代表“通过”或“失败”。使用测量和预测 GPR 之间的线性回归计算了置信水平为 99%和 99.9%的 LCL 值作为 GPR 的上限预测值。
低测量 GPR 存在高估趋势。在全局 2%/2mm 时,RIDGE、RF、SVR 和 STACKING 的最大预测误差分别为 3.2%、2.9%、2.3%和 2.2%,在局部 3%/2mm 时,最大预测误差分别为 6.3%、6.6%、6.1%和 5.5%。在全局 2%/2mm 时,除 RIDGE 外,所有机器学习模型的灵敏度均为 100%,当使用 99% LCL 时。RIDGE、RF 和 SVR 的特异性为 76.1%,STACKING 的特异性为 66.3%;然而,当使用 99.9% LCL 时,特异性显著下降。然而,在局部 3%/2mm 时,仅 STACKING 在使用 99% LCL 时显示出 100%的灵敏度。使用 99.9% LCL 时特异性的下降小于全局 2%/2mm,RIDGE、RF、SVR 和 STACKING 的特异性分别为 61.3%、61.3%、72.0%和 66.8%。
STACKING 对低 GPR 值的预测精度优于其他机器学习模型。将 LCL 应用于回归模型可以实现对定量和定性 GPR 预测的一致评估。调整 LCL 的置信水平有助于在灵敏度和特异性之间取得平衡。我们建议 STACKING 可以协助 PSQA 的安全高效运行。