Department of Radiation Medicine and Applied Sciences, University of California San Diego, La Jolla, CA.
School of Medicine, University of California San Diego, La Jolla, CA.
JCO Clin Cancer Inform. 2021 Mar;5:279-287. doi: 10.1200/CCI.20.00137.
Pancreatic cancer is an aggressive malignancy with patients often experiencing nonspecific symptoms before diagnosis. This study evaluates a machine learning approach to help identify patients with early-stage pancreatic cancer from clinical data within electronic health records (EHRs).
From the Optum deidentified EHR data set, we identified early-stage (n = 3,322) and late-stage (n = 25,908) pancreatic cancer cases over 40 years of age diagnosed between 2009 and 2017. Patients with early-stage pancreatic cancer were matched to noncancer controls (1:16 match). We constructed a prediction model using eXtreme Gradient Boosting (XGBoost) to identify early-stage patients on the basis of 18,220 features within the EHR including diagnoses, procedures, information within clinical notes, and medications. Model accuracy was assessed with sensitivity, specificity, positive predictive value, and the area under the curve.
The final predictive model included 582 predictive features from the EHR, including 248 (42.5%) physician note elements, 146 (25.0%) procedure codes, 91 (15.6%) diagnosis codes, 89 (15.3%) medications, and 9 (1.5%) demographic features. The final model area under the curve was 0.84. Choosing a model cut point with a sensitivity of 60% and specificity of 90% would enable early detection of 58% late-stage patients with a median of 24 months before their actual diagnosis.
Prediction models using EHR data show promise in the early detection of pancreatic cancer. Although widespread use of this approach on an unselected population would produce high rates of false-positive tests, this technique may be rapidly impactful if deployed among high-risk patients or paired with other imaging or biomarker screening tools.
胰腺癌是一种侵袭性恶性肿瘤,患者在诊断前常出现非特异性症状。本研究评估了一种机器学习方法,以帮助从电子健康记录(EHR)中的临床数据中识别出早期胰腺癌患者。
从 Optum 去识别 EHR 数据集,我们确定了 40 岁以上在 2009 年至 2017 年期间诊断为早期(n = 3322)和晚期(n = 25908)胰腺癌的病例。将早期胰腺癌患者与非癌症对照组(1:16 匹配)进行匹配。我们使用极端梯度提升(XGBoost)构建了一个预测模型,根据 EHR 中的 18220 个特征(包括诊断、程序、临床记录中的信息和药物)来识别早期患者。使用敏感性、特异性、阳性预测值和曲线下面积评估模型准确性。
最终的预测模型包括来自 EHR 的 582 个预测特征,包括 248 个(42.5%)医生笔记元素、146 个(25.0%)程序代码、91 个(15.6%)诊断代码、89 个(15.3%)药物和 9 个(1.5%)人口统计学特征。最终模型的曲线下面积为 0.84。选择一个灵敏度为 60%、特异性为 90%的模型切点,可以在实际诊断前中位数为 24 个月时提前发现 58%的晚期患者。
使用 EHR 数据的预测模型在胰腺癌的早期检测方面显示出了前景。虽然在未选择的人群中广泛使用这种方法会产生高假阳性测试率,但如果在高危患者中部署或与其他成像或生物标志物筛查工具结合使用,这种技术可能会迅速产生影响。