Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA.
Department of Medical Ethics and Health Policy, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA.
JCO Clin Cancer Inform. 2021 Sep;5:1015-1023. doi: 10.1200/CCI.21.00077.
Machine learning models developed from electronic health records data have been increasingly used to predict risk of mortality for general oncology patients. But these models may have suboptimal performance because of patient heterogeneity. The objective of this work is to develop a new modeling approach to predicting short-term mortality that accounts for heterogeneity across multiple subgroups in the presence of a large number of electronic health record predictors.
We proposed a two-stage approach to addressing heterogeneity among oncology patients of different cancer types for predicting their risk of mortality. Structured data were extracted from the University of Pennsylvania Health System for 20,723 patients of 11 cancer types, where 1,340 (6.5%) patients were deceased. We first modeled the overall risk for all patients without differentiating cancer types, as is done in the current practice. We then developed cancer type-specific models using the overall risk score as a predictor along with preselected type-specific predictors. The overall and type-specific models were compared with respect to discrimination using the area under the precision-recall curve (AUPRC) and calibration using the calibration slope. We also proposed metrics that characterize the degree of risk heterogeneity by comparing risk predictors in the overall and type-specific models.
The two-stage modeling resulted in improved calibration and discrimination across all 11 cancer types. The improvement in AUPRC was significant for hematologic malignancies including leukemia, lymphoma, and myeloma. For instance, the AUPRC increased from 0.358 to 0.519 (∆ = 0.161; 95% CI, 0.102 to 0.224) and from 0.299 to 0.354 (∆ = 0.055; 95% CI, 0.009 to 0.107) for leukemia and lymphoma, respectively. For all 11 cancer types, the two-stage approach generated well-calibrated risks. A high degree of heterogeneity between type-specific and overall risk predictors was observed for most cancer types.
Our two-stage modeling approach that accounts for cancer type-specific risk heterogeneity has improved calibration and discrimination than a model agnostic to cancer types.
基于电子病历数据开发的机器学习模型已越来越多地用于预测一般肿瘤患者的死亡率风险。但由于患者的异质性,这些模型的表现可能并不理想。本研究旨在开发一种新的建模方法,以预测短期死亡率,同时考虑到大量电子病历预测因子存在的情况下,多个亚组之间的异质性。
我们提出了一种两阶段方法来解决不同癌症类型的肿瘤患者之间的异质性,以预测他们的死亡率风险。从宾夕法尼亚大学健康系统中提取了 20723 名 11 种癌症类型患者的结构化数据,其中 1340 名(6.5%)患者死亡。我们首先对所有患者进行建模,不区分癌症类型,这是目前的做法。然后,我们使用整体风险评分作为预测因子,结合预先选择的特定类型的预测因子,开发特定于癌症类型的模型。我们比较了整体和特定于癌症类型的模型在使用精度-召回曲线下面积(AUPRC)进行区分和使用校准斜率进行校准方面的性能。我们还提出了一些指标,通过比较整体和特定于癌症类型的模型中的风险预测因子,来描述风险异质性的程度。
两阶段建模提高了所有 11 种癌症类型的校准和区分能力。在血液恶性肿瘤(包括白血病、淋巴瘤和骨髓瘤)方面,AUPRC 的提高更为显著。例如,白血病的 AUPRC 从 0.358 提高到 0.519(∆=0.161;95%CI,0.102 至 0.224),淋巴瘤的 AUPRC 从 0.299 提高到 0.354(∆=0.055;95%CI,0.009 至 0.107)。对于所有 11 种癌症类型,两阶段方法生成的风险都具有良好的校准能力。大多数癌症类型的特定于癌症类型的风险预测因子和整体风险预测因子之间存在高度的异质性。
与不考虑癌症类型的模型相比,我们的两阶段建模方法考虑了特定于癌症类型的风险异质性,提高了校准和区分能力。