M. E. R. Bongers, Q. C. B. S. Thio, A.V. Karhade, M. L. Stor, K. A. Raskin, S. A. Lozano-Calderon, Department of Orthopaedic Surgery, Division of Orthopaedic Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA T. F. DeLaney, Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA, USA M. L. Ferrone, Department of Orthopaedic Surgery, Orthopaedic Oncology Service, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA J. H. Schwab, Department of Orthopaedic Surgery, Division of Orthopaedic Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
Clin Orthop Relat Res. 2019 Oct;477(10):2296-2303. doi: 10.1097/CORR.0000000000000748.
We developed a machine learning algorithm to predict the survival of patients with chondrosarcoma. The algorithm demonstrated excellent discrimination and calibration on internal validation in a derivation cohort based on data from the Surveillance, Epidemiology, and End Results (SEER) registry. However, the algorithm has not been validated in an independent external dataset.
QUESTIONS/PURPOSES: Does the Skeletal Oncology Research Group (SORG) algorithm accurately predict 5-year survival in an independent patient population surgically treated for chondrosarcoma?
The SORG algorithm was developed using the SEER registry, which contains demographic data, tumor characteristics, treatment, and outcome values; and includes approximately 30% of the cancer patients in the United States. The SEER registry was ideal for creating the derivation cohort, and consequently the SORG algorithm, because of the high number of eligible patients and the availability of most (explanatory) variables of interest. Between 1992 to 2013, 326 patients were treated surgically for extracranial chondrosarcoma of the bone at two tertiary care referral centers. Of those, 179 were accounted for at a minimum of 5 years after diagnosis in a clinical note at one of the two institutions, unless they died earlier, and were included in the validation cohort. In all, 147 (45%) did not meet the minimum 5 years of followup at the institution and were not included in the validation of the SORG algorithm. The outcome (survival at 5 years) was checked for all 326 patients in the Social Security death index and were included in the supplemental validation cohort, to also ascertain validity for patients with less than 5 years of institutional followup. Variables used in the SORG algorithm to predict 5-year survival including sex, age, histologic subtype, tumor grade, tumor size, tumor extension, and tumor location were collected manually from medical records. The tumor characteristics were collected from the postoperative musculoskeletal pathology report. Predicted probabilities of 5-year survival were calculated for each patient in the validation cohort using the SORG algorithm, followed by an assessment of performance using the same metrics as used for internal validation, namely: discrimination, calibration, and overall performance. Discrimination was calculated using the concordance statistic (or the area under the Receiver Operating Characteristic (ROC) curve) to determine how well the algorithm discriminates between the outcome, which ranges from 0.5 (no better than a coin-toss) to 1.0 (perfect discrimination). Calibration was assessed using the calibration slope and intercept from a calibration plot to measure the agreement between predicted and observed outcomes. A perfect calibration plot should show a 45° upwards line. Overall performance was determined using the Brier score, ranging from 0 (excellent prediction) to 1 (worst prediction). The Brier score was compared with the null-model Brier score, which showed the performance of a model that ignored all the covariates. A Brier score lower than the null model Brier score indicated greater performance of the algorithm. For the external validation an F1-score was added to measure the overall accuracy of the algorithm, which ranges between 0 (total failure of an algorithm) and 1 (perfect algorithm).The 5-year survival was lower in the validation cohort than it was in the derivation cohort from SEER (61.5% [110 of 179] versus 76% [1131 of 1544] ; p < 0.001). This difference was driven by higher proportion of dedifferentiated chondrosarcoma in the institutional population than in the derivation cohort (27% [49 of 179] versus 9% [131 of 1544]; p < 0.001). Patients in the validation cohort also had larger tumor sizes, higher grades, and nonextremity tumor locations than did those in the derivation cohort. These differences between the study groups emphasize that the external validation is performed not only in a different patient cohort, but also in terms of disease characteristics. Five-year survival was not different for both patient groups between subpopulations of patients with conventional chondrosarcomas and those with dedifferentiated chondrosarcomas.
The concordance statistic for the validation cohort was 0.87 (95% CI, 0.80-0.91). Evaluation of the algorithm's calibration in the institutional population resulted in a calibration slope of 0.97 (95% CI, 0.68-1.3) and calibration intercept of -0.58 (95% CI, -0.20 to -0.97). Finally, on overall performance, the algorithm had a Brier score of 0.152 compared with a null-model Brier score of 0.237 for a high level of overall performance. The F1-score was 0.836. For the supplementary validation in the total of 326 patients, the SORG algorithm had a validation of 0.89 (95% CI, 0.85-0.93). The calibration slope was 1.13 (95% CI, 0.87-1.39) and the calibration intercept was -0.26 (95% CI, -0.57 to 0.06). The Brier score was 0.11, with a null-model Brier score of 0.19. The F1-score was 0.901.
On external validation, the SORG algorithm retained good discriminative ability and overall performance but overestimated 5-year survival in patients surgically treated for chondrosarcoma. This internet-based tool can help guide patient counseling and shared decision making.
Level III, prognostic study.
我们开发了一种机器学习算法来预测软骨肉瘤患者的生存情况。该算法在基于 SEER 注册中心数据的推导队列中进行内部验证时表现出了出色的区分度和校准度。然而,该算法尚未在独立的外部数据集上进行验证。
问题/目的:Skeletal Oncology Research Group(SORG)算法是否能准确预测接受软骨肉瘤手术治疗的患者的 5 年生存率?
SORG 算法是使用 SEER 注册中心开发的,该注册中心包含人口统计学数据、肿瘤特征、治疗和结局值;并包括美国约 30%的癌症患者。SEER 注册中心非常适合创建推导队列,因此也是 SORG 算法的理想选择,这是因为其合格患者数量众多,并且包含大多数(解释性)感兴趣的变量。在 1992 年至 2013 年期间,两家三级医疗机构对 2 家机构的 326 例骨外软骨肉瘤患者进行了手术治疗。在这 326 例患者中,有 179 例在两家机构中的一家机构至少随访了 5 年,除非他们更早死亡,并被纳入验证队列。总的来说,有 147 例(45%)在该机构没有达到 5 年的最低随访时间,因此没有被纳入 SORG 算法的验证。通过社会保障死亡指数对所有 326 例患者的生存情况进行了检查,并纳入了补充验证队列,以确定在机构随访时间不足 5 年的患者的有效性。在推导队列中,使用 SORG 算法预测 5 年生存率的变量包括性别、年龄、组织学亚型、肿瘤分级、肿瘤大小、肿瘤扩展和肿瘤位置,这些变量都是从病历中手动收集的。肿瘤特征是从术后肌肉骨骼病理学报告中收集的。使用 SORG 算法为验证队列中的每个患者计算 5 年生存率的预测概率,然后使用与内部验证相同的指标评估其性能,即:区分度、校准度和整体性能。区分度通过一致性统计量(或接收器操作特征曲线下的面积)来计算,以确定算法在区分结局方面的表现,结局的范围从 0.5(不比掷硬币好)到 1.0(完美区分)。校准度通过校准图中的校准斜率和截距来评估,以衡量预测结果与观察结果之间的一致性。完美的校准图应该显示一条 45°向上的线。整体性能通过 Brier 得分来确定,范围从 0(最佳预测)到 1(最差预测)。Brier 得分与零模型 Brier 得分进行比较,零模型 Brier 得分显示了忽略所有协变量的模型的性能。Brier 得分低于零模型 Brier 得分表明算法的性能更好。对于外部验证,还添加了 F1 分数来衡量算法的整体准确性,范围从 0(算法完全失败)到 1(完美算法)。验证队列中的 5 年生存率低于 SEER 推导队列中的生存率(61.5%[179 例中的 110 例]与 76%[1544 例中的 1131 例];p<0.001)。这种差异是由于机构人群中去分化软骨肉瘤的比例高于推导队列(27%[179 例中的 49 例]与 9%[1544 例中的 131 例];p<0.001)。验证队列中的患者肿瘤大小更大、分级更高、肿瘤位置不在四肢,与推导队列中的患者不同。这些研究组之间的差异强调了外部验证不仅在不同的患者队列中进行,而且在疾病特征方面也进行了验证。在常规软骨肉瘤和去分化软骨肉瘤患者亚组中,两组患者的 5 年生存率没有差异。
验证队列的一致性统计量为 0.87(95%置信区间,0.80-0.91)。在机构人群中评估算法的校准结果得出校准斜率为 0.97(95%置信区间,0.68-1.3)和校准截距为-0.58(95%置信区间,-0.20 至-0.97)。最后,在整体性能方面,算法的 Brier 得分为 0.152,而零模型 Brier 得分为 0.237,表现出较高的整体性能。F1 分数为 0.836。对于总共 326 例患者的补充验证,SORG 算法的验证率为 0.89(95%置信区间,0.85-0.93)。校准斜率为 1.13(95%置信区间,0.87-1.39),校准截距为-0.26(95%置信区间,-0.57 至 0.06)。Brier 得分为 0.11,零模型 Brier 得分为 0.19。F1 分数为 0.901。
在外部验证中,SORG 算法保留了良好的区分能力和整体性能,但高估了接受软骨肉瘤手术治疗的患者的 5 年生存率。这种基于互联网的工具可以帮助指导患者咨询和共同决策。
III 级,预后研究。