Department of Hematology/Oncology, Fox Chase Cancer Center, Philadelphia, Pennsylvania.
Cancer Prevention and Control Research Program, Fox Chase Cancer Center, Philadelphia, Pennsylvania.
JAMA Netw Open. 2023 Aug 1;6(8):e2328712. doi: 10.1001/jamanetworkopen.2023.28712.
Delays in starting cancer treatment disproportionately affect vulnerable populations and can influence patients' experience and outcomes. Machine learning algorithms incorporating electronic health record (EHR) data and neighborhood-level social determinants of health (SDOH) measures may identify at-risk patients.
To develop and validate a machine learning model for estimating the probability of a treatment delay using multilevel data sources.
DESIGN, SETTING, AND PARTICIPANTS: This cohort study evaluated 4 different machine learning approaches for estimating the likelihood of a treatment delay greater than 60 days (group least absolute shrinkage and selection operator [LASSO], bayesian additive regression tree, gradient boosting, and random forest). Criteria for selecting between approaches were discrimination, calibration, and interpretability/simplicity. The multilevel data set included clinical, demographic, and neighborhood-level census data derived from the EHR, cancer registry, and American Community Survey. Patients with invasive breast, lung, colorectal, bladder, or kidney cancer diagnosed from 2013 to 2019 and treated at a comprehensive cancer center were included. Data analysis was performed from January 2022 to June 2023.
Variables included demographics, cancer characteristics, comorbidities, laboratory values, imaging orders, and neighborhood variables.
The outcome estimated by machine learning models was likelihood of a delay greater than 60 days between cancer diagnosis and treatment initiation. The primary metric used to evaluate model performance was area under the receiver operating characteristic curve (AUC-ROC).
A total of 6409 patients were included (mean [SD] age, 62.8 [12.5] years; 4321 [67.4%] female; 2576 [40.2%] with breast cancer, 1738 [27.1%] with lung cancer, and 1059 [16.5%] with kidney cancer). A total of 1621 (25.3%) experienced a delay greater than 60 days. The selected group LASSO model had an AUC-ROC of 0.713 (95% CI, 0.679-0.745). Lower likelihood of delay was seen with diagnosis at the treating institution; first malignant neoplasm; Asian or Pacific Islander or White race; private insurance; and lacking comorbidities. Greater likelihood of delay was seen at the extremes of neighborhood deprivation. Model performance (AUC-ROC) was lower in Black patients, patients with race and ethnicity other than non-Hispanic White, and those living in the most disadvantaged neighborhoods. Though the model selected neighborhood SDOH variables as contributing variables, performance was similar when fit with and without these variables.
In this cohort study, a machine learning model incorporating EHR and SDOH data was able to estimate the likelihood of delays in starting cancer therapy. Future work should focus on additional ways to incorporate SDOH data to improve model performance, particularly in vulnerable populations.
癌症治疗开始的延误会不成比例地影响弱势群体,并可能影响患者的体验和结果。利用电子健康记录 (EHR) 数据和邻里层面的健康社会决定因素 (SDOH) 措施的机器学习算法可能会识别出高危患者。
利用多水平数据来源开发和验证一种用于估计治疗延迟概率的机器学习模型。
设计、设置和参与者:这项队列研究评估了 4 种不同的机器学习方法来估计治疗延迟超过 60 天的可能性(最小绝对收缩和选择算子 [LASSO]、贝叶斯加法回归树、梯度提升和随机森林)。选择方法的标准是区分度、校准和可解释性/简单性。多水平数据集包括从 EHR、癌症登记处和美国社区调查中提取的临床、人口统计学和邻里级别的人口普查数据。该研究纳入了 2013 年至 2019 年间诊断为浸润性乳腺癌、肺癌、结直肠癌、膀胱癌或肾癌且在综合性癌症中心接受治疗的患者。数据分析于 2022 年 1 月至 2023 年 6 月进行。
变量包括人口统计学、癌症特征、合并症、实验室值、影像学检查和邻里变量。
机器学习模型估计的结果是癌症诊断和治疗开始之间延迟超过 60 天的可能性。用于评估模型性能的主要指标是接受者操作特征曲线下的面积 (AUC-ROC)。
共纳入 6409 例患者(平均[标准差]年龄,62.8[12.5]岁;4321[67.4%]为女性;2576[40.2%]为乳腺癌患者,1738[27.1%]为肺癌患者,1059[16.5%]为肾癌患者)。共有 1621 例(25.3%)经历了超过 60 天的延迟。所选的组 LASSO 模型的 AUC-ROC 为 0.713(95%CI,0.679-0.745)。在治疗机构诊断、首次恶性肿瘤、亚裔或太平洋岛民或白人种族、私人保险和没有合并症的情况下,发生延迟的可能性较低。在邻里剥夺程度的极值处,发生延迟的可能性更大。黑人患者、非西班牙裔白人以外的种族和民族患者以及居住在最不利邻里的患者的模型性能(AUC-ROC)较低。尽管该模型选择了邻里 SDOH 变量作为贡献变量,但在包含和不包含这些变量时,性能相似。
在这项队列研究中,一种纳入 EHR 和 SDOH 数据的机器学习模型能够估计开始癌症治疗的延迟概率。未来的工作应重点关注以其他方式纳入 SDOH 数据以提高模型性能,特别是在弱势群体中。