Department of Orthopaedic Surgery, National Taiwan University Hospital, Taipei, Taiwan.
Department of Medical Education, National Taiwan University Hospital, Taipei, Taiwan.
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
The Skeletal Oncology Research Group machine-learning algorithm (SORG-MLA) was developed to predict the survival of patients with spinal metastasis. The algorithm was successfully tested in five international institutions using 1101 patients from different continents. The incorporation of 18 prognostic factors strengthens its predictive ability but limits its clinical utility because some prognostic factors might not be clinically available when a clinician wishes to make a prediction.
QUESTIONS/PURPOSES: We performed this study to (1) evaluate the SORG-MLA's performance with data and (2) develop an internet-based application to impute the missing data.
A total of 2768 patients were included in this study. The data of 617 patients who were treated surgically were intentionally erased, and the data of the other 2151 patients who were treated with radiotherapy and medical treatment were used to impute the artificially missing data. Compared with those who were treated nonsurgically, patients undergoing surgery were younger (median 59 years [IQR 51 to 67 years] versus median 62 years [IQR 53 to 71 years]) and had a higher proportion of patients with at least three spinal metastatic levels (77% [474 of 617] versus 72% [1547 of 2151]), more neurologic deficit (normal American Spinal Injury Association [E] 68% [301 of 443] versus 79% [1227 of 1561]), higher BMI (23 kg/m 2 [IQR 20 to 25 kg/m 2 ] versus 22 kg/m 2 [IQR 20 to 25 kg/m 2 ]), higher platelet count (240 × 10 3 /µL [IQR 173 to 327 × 10 3 /µL] versus 227 × 10 3 /µL [IQR 165 to 302 × 10 3 /µL], higher lymphocyte count (15 × 10 3 /µL [IQR 9 to 21× 10 3 /µL] versus 14 × 10 3 /µL [IQR 8 to 21 × 10 3 /µL]), lower serum creatinine level (0.7 mg/dL [IQR 0.6 to 0.9 mg/dL] versus 0.8 mg/dL [IQR 0.6 to 1.0 mg/dL]), less previous systemic therapy (19% [115 of 617] versus 24% [526 of 2151]), fewer Charlson comorbidities other than cancer (28% [170 of 617] versus 36% [770 of 2151]), and longer median survival. The two patient groups did not differ in other regards. These findings aligned with our institutional philosophy of selecting patients for surgical intervention based on their level of favorable prognostic factors such as BMI or lymphocyte counts and lower levels of unfavorable prognostic factors such as white blood cell counts or serum creatinine level, as well as the degree of spinal instability and severity of neurologic deficits. This approach aims to identify patients with better survival outcomes and prioritize their surgical intervention accordingly. Seven factors (serum albumin and alkaline phosphatase levels, international normalized ratio, lymphocyte and neutrophil counts, and the presence of visceral or brain metastases) were considered possible missing items based on five previous validation studies and clinical experience. Artificially missing data were imputed using the missForest imputation technique, which was previously applied and successfully tested to fit the SORG-MLA in validation studies. Discrimination, calibration, overall performance, and decision curve analysis were applied to evaluate the SORG-MLA's performance. The discrimination ability was measured with an area under the receiver operating characteristic curve. It ranges from 0.5 to 1.0, with 0.5 indicating the worst discrimination and 1.0 indicating perfect discrimination. An area under the curve of 0.7 is considered clinically acceptable discrimination. Calibration refers to the agreement between the predicted outcomes and actual outcomes. An ideal calibration model will yield predicted survival rates that are congruent with the observed survival rates. The Brier score measures the squared difference between the actual outcome and predicted probability, which captures calibration and discrimination ability simultaneously. A Brier score of 0 indicates perfect prediction, whereas a Brier score of 1 indicates the poorest prediction. A decision curve analysis was performed for the 6-week, 90-day, and 1-year prediction models to evaluate their net benefit across different threshold probabilities. Using the results from our analysis, we developed an internet-based application that facilitates real-time data imputation for clinical decision-making at the point of care. This tool allows healthcare professionals to efficiently and effectively address missing data, ensuring that patient care remains optimal at all times.
Generally, the SORG-MLA demonstrated good discriminatory ability, with areas under the curve greater than 0.7 in most cases, and good overall performance, with up to 25% improvement in Brier scores in the presence of one to three missing items. The only exceptions were albumin level and lymphocyte count, because the SORG-MLA's performance was reduced when these two items were missing, indicating that the SORG-MLA might be unreliable without these values. The model tended to underestimate the patient survival rate. As the number of missing items increased, the model's discriminatory ability was progressively impaired, and a marked underestimation of patient survival rates was observed. Specifically, when three items were missing, the number of actual survivors was up to 1.3 times greater than the number of expected survivors, while only 10% discrepancy was observed when only one item was missing. When either two or three items were omitted, the decision curves exhibited substantial overlap, indicating a lack of consistent disparities in performance. This finding suggests that the SORG-MLA consistently generates accurate predictions, regardless of the two or three items that are omitted. We developed an internet application ( https://sorg-spine-mets-missing-data-imputation.azurewebsites.net/ ) that allows the use of SORG-MLA with up to three missing items.
The SORG-MLA generally performed well in the presence of one to three missing items, except for serum albumin level and lymphocyte count (which are essential for adequate predictions, even using our modified version of the SORG-MLA). We recommend that future studies should develop prediction models that allow for their use when there are missing data, or provide a means to impute those missing data, because some data are not available at the time a clinical decision must be made.
The results suggested the algorithm could be helpful when a radiologic evaluation owing to a lengthy waiting period cannot be performed in time, especially in situations when an early operation could be beneficial. It could help orthopaedic surgeons to decide whether to intervene palliatively or extensively, even when the surgical indication is clear.
Skeletal Oncology Research Group 机器学习算法(SORG-MLA)是为了预测脊柱转移患者的生存而开发的。该算法已经在五个国际机构中使用来自不同大洲的 1101 名患者成功进行了测试。纳入 18 个预后因素增强了其预测能力,但限制了其临床实用性,因为当临床医生希望进行预测时,有些预后因素可能无法临床获得。
问题/目的:我们进行这项研究的目的是:(1)评估 SORG-MLA 在数据中的表现;(2)开发一个基于互联网的应用程序来填补缺失数据。
共有 2768 名患者纳入本研究。617 名接受手术治疗的患者的数据被故意删除,其余 2151 名接受放射治疗和药物治疗的患者的数据被用于填补人工缺失数据。与非手术治疗患者相比,接受手术治疗的患者更年轻(中位数 59 岁[四分位距 51 至 67 岁]与中位数 62 岁[四分位距 53 至 71 岁]),且具有更高比例的至少有三个脊柱转移水平的患者(77%[474/617]与 72%[1547/2151]),更多的神经功能缺损(正常美国脊柱损伤协会[E]68%[301/443]与 79%[1227/1561]),更高的 BMI(23 kg/m 2 [四分位距 20 至 25 kg/m 2 ]与 22 kg/m 2 [四分位距 20 至 25 kg/m 2 ]),更高的血小板计数(240×10 3 /µL[四分位距 173 至 327×10 3 /µL]与 227×10 3 /µL[四分位距 165 至 302×10 3 /µL]),更高的淋巴细胞计数(15×10 3 /µL[四分位距 9 至 21×10 3 /µL]与 14×10 3 /µL[四分位距 8 至 21×10 3 /µL]),更低的血清肌酐水平(0.7 mg/dL[四分位距 0.6 至 0.9 mg/dL]与 0.8 mg/dL[四分位距 0.6 至 1.0 mg/dL]),更少的先前系统性治疗(19%[115/617]与 24%[526/2151]),更少的癌症以外的其他合并症(28%[170/617]与 36%[770/2151]),以及更长的中位生存时间。两组患者在其他方面没有差异。这些发现与我们的机构理念一致,即根据患者的有利预后因素(如 BMI 或淋巴细胞计数)和不利预后因素(如白细胞计数或血清肌酐水平)的水平,以及脊柱不稳定和神经功能缺损的严重程度,选择患者进行手术干预。这种方法旨在识别具有更好生存结果的患者,并相应地优先进行手术干预。根据五项先前的验证研究和临床经验,有七种因素(血清白蛋白和碱性磷酸酶水平、国际标准化比值、淋巴细胞和中性粒细胞计数,以及内脏或脑转移的存在)被认为是可能缺失的项目。使用 missForest 插补技术对人工缺失数据进行插补,该技术以前曾应用于并成功测试,以适应验证研究中的 SORG-MLA。使用受试者工作特征曲线下面积来评估 SORG-MLA 的性能。它的范围从 0.5 到 1.0,其中 0.5 表示最差的区分度,1.0 表示完美的区分度。0.7 的曲线下面积被认为是可接受的临床区分度。校准是指预测结果与实际结果之间的一致性。理想的校准模型将产生与观察到的生存率一致的预测生存率。Brier 评分衡量实际结果与预测概率之间的平方差异,同时捕捉了校准和区分能力。Brier 得分为 0 表示完美预测,而 Brier 得分为 1 表示最差预测。对 6 周、90 天和 1 年的预测模型进行决策曲线分析,以评估其在不同阈值概率下的净收益。使用我们的分析结果,我们开发了一个基于互联网的应用程序,该应用程序可在护理点实时进行数据插补,以支持临床决策。该工具允许医疗保健专业人员高效、有效地处理缺失数据,确保患者护理始终处于最佳状态。
一般来说,SORG-MLA 在大多数情况下表现出良好的区分能力,曲线下面积大于 0.7,整体性能良好,在存在一个到三个缺失项的情况下,Brier 评分提高了高达 25%。唯一的例外是白蛋白水平和淋巴细胞计数,因为当这两个项目缺失时,SORG-MLA 的性能会降低,这表明如果没有这些值,SORG-MLA 可能不可靠。该模型倾向于低估患者的生存率。随着缺失项数目的增加,模型的区分能力逐渐受损,观察到患者生存率的显著低估。具体来说,当缺失三项时,实际幸存者的数量是预期幸存者的 1.3 倍,而当仅缺失一项时,只有 10%的差异。当缺失两个或三个项目时,决策曲线显示出明显的重叠,表明性能没有一致的差异。这一发现表明,SORG-MLA 始终生成准确的预测,无论缺失的项目数量如何。我们开发了一个基于互联网的应用程序(https://sorg-spine-mets-missing-data-imputation.azurewebsites.net/),允许使用 SORG-MLA 进行多达三个缺失项的预测。
SORG-MLA 在存在一个到三个缺失项的情况下,一般表现良好,除了血清白蛋白水平和淋巴细胞计数(即使使用我们修改后的 SORG-MLA,这两个项目对预测也很重要)。我们建议未来的研究应该开发允许在存在缺失数据时使用的预测模型,或者提供一种填补这些缺失数据的方法,因为在需要做出临床决策时,有些数据可能无法获得。
研究结果表明,当由于等待时间较长而无法进行放射学评估时,该算法可能会有所帮助,尤其是在早期手术可能有益的情况下。它可以帮助骨科医生决定是否进行姑息性或广泛的干预,即使手术指征明确。