HealthCore, Inc., Wilmington, DE, United States; Boston University, Boston, MA, United States.
Pfizer, Inc., Collegeville, PA, United States.
Cancer Epidemiol. 2019 Aug;61:30-37. doi: 10.1016/j.canep.2019.05.006. Epub 2019 May 22.
Although healthcare databases are a valuable source for real-world oncology data, cancer stage is often lacking. We developed predictive models using claims data to identify metastatic/advanced-stage patients with ovarian cancer, urothelial carcinoma, gastric adenocarcinoma, Merkel cell carcinoma (MCC), and non-small cell lung cancer (NSCLC).
Patients with ≥1 diagnosis of a cancer of interest were identified in the HealthCore Integrated Research Database (HIRD), a United States (US) healthcare database (2010-2016). Data were linked to three US state cancer registries and the HealthCore Integrated Research Environment Oncology database to identify cancer stage. Predictive models were constructed to estimate the probability of metastatic/advanced stage. Predictors available in the HIRD were identified and coefficients estimated by Least Absolute Shrinkage and Selection Operator (LASSO) regression with cross-validation to control overfitting. Classification error rates and receiver operating characteristic curves were used to select probability thresholds for classifying patients as cases of metastatic/advanced cancer.
We used 2723 ovarian cancer, 6522 urothelial carcinoma, 1441 gastric adenocarcinoma, 109 MCC, and 12,373 NSCLC cases of early and metastatic/advanced cancer to develop predictive models. All models had high discrimination (C > 0.85). At thresholds selected for each model, PPVs were all >0.75: ovarian cancer = 0.95 (95% confidence interval [95% CI]: 0.94-0.96), urothelial carcinoma = 0.78 (95% CI: 0.70-0.86), gastric adenocarcinoma = 0.86 (95% CI: 0.83-0.88), MCC = 0.77 (95% CI 0.68-0.89), and NSCLC = 0.91 (95% CI 0.90 - 0.92).
Predictive modeling was used to identify five types of metastatic/advanced cancer in a healthcare claims database with greater accuracy than previous methods.
尽管医疗保健数据库是真实世界肿瘤学数据的宝贵来源,但癌症分期通常是缺失的。我们使用索赔数据开发了预测模型,以识别患有卵巢癌、尿路上皮癌、胃腺癌、默克尔细胞癌 (MCC) 和非小细胞肺癌 (NSCLC) 的转移性/晚期癌症患者。
在医疗保健数据库(2010-2016 年)HealthCore 综合研究数据库 (HIRD) 中确定了≥1 例感兴趣癌症的患者。将数据与三个美国州癌症登记处和 HealthCore 综合研究环境肿瘤学数据库进行链接,以确定癌症分期。构建预测模型来估计转移性/晚期的概率。通过交叉验证识别 HIRD 中可用的预测因子并估计系数,以控制过拟合。使用分类错误率和接收者操作特征曲线选择概率阈值,以将患者分类为转移性/晚期癌症病例。
我们使用 2723 例卵巢癌、6522 例尿路上皮癌、1441 例胃腺癌、109 例 MCC 和 12373 例早期和转移性/晚期 NSCLC 病例来开发预测模型。所有模型的区分度都很高(C > 0.85)。在为每个模型选择的阈值下,PPV 均> 0.75:卵巢癌= 0.95(95%置信区间[95%CI]:0.94-0.96),尿路上皮癌= 0.78(95%CI:0.70-0.86),胃腺癌= 0.86(95%CI:0.83-0.88),MCC= 0.77(95%CI 0.68-0.89),NSCLC= 0.91(95%CI 0.90-0.92)。
使用预测模型在医疗保健索赔数据库中以比以前的方法更高的准确性识别了五种转移性/晚期癌症类型。