Division of Gastroenterology and Hepatology, Stanford University School of Medicine, Stanford, CA.
Division of Gastroenterology, University of Washington, Seattle, WA.
JCO Clin Cancer Inform. 2022 Jun;6:e2200039. doi: 10.1200/CCI.22.00039.
Noncardia gastric cancer (NCGC) is a leading cause of global cancer mortality, and is often diagnosed at advanced stages. Development of NCGC risk models within electronic health records (EHR) may allow for improved cancer prevention. There has been much recent interest in use of machine learning (ML) for cancer prediction, but few studies comparing ML with classical statistical models for NCGC risk prediction.
We trained models using logistic regression (LR) and four commonly used ML algorithms to predict NCGC from age-/sex-matched controls in two EHR systems: Stanford University and the University of Washington (UW). The LR model contained well-established NCGC risk factors (intestinal metaplasia histology, prior infection, race, ethnicity, nativity status, smoking history, anemia), whereas ML models agnostically selected variables from the EHR. Models were developed and internally validated in the Stanford data, and externally validated in the UW data. Hyperparameter tuning of models was achieved using cross-validation. Model performance was compared by accuracy, sensitivity, and specificity.
In internal validation, LR performed with comparable accuracy (0.732; 95% CI, 0.698 to 0.764), sensitivity (0.697; 95% CI, 0.647 to 0.744), and specificity (0.767; 95% CI, 0.720 to 0.809) to penalized lasso, support vector machine, K-nearest neighbor, and random forest models. In external validation, LR continued to demonstrate high accuracy, sensitivity, and specificity. Although K-nearest neighbor demonstrated higher accuracy and specificity, this was offset by significantly lower sensitivity. No ML model consistently outperformed LR across evaluation criteria.
Drawing data from two independent EHRs, we find LR on the basis of established risk factors demonstrated comparable performance to optimized ML algorithms. This study demonstrates that classical models built on robust, hand-chosen predictor variables may not be inferior to data-driven models for NCGC risk prediction.
非贲门胃癌(NCGC)是全球癌症死亡的主要原因,且通常在晚期诊断。在电子健康记录(EHR)中开发 NCGC 风险模型可能有助于改善癌症预防。最近,人们对使用机器学习(ML)进行癌症预测产生了浓厚的兴趣,但很少有研究将 ML 与用于 NCGC 风险预测的经典统计模型进行比较。
我们使用逻辑回归(LR)和四种常用的 ML 算法在斯坦福大学和华盛顿大学(UW)的两个 EHR 系统中从年龄/性别匹配的对照中训练预测 NCGC 的模型。LR 模型包含已确立的 NCGC 风险因素(肠化生组织学、既往感染、种族、民族、原籍国状况、吸烟史、贫血),而 ML 模型则从 EHR 中盲目选择变量。在斯坦福大学的数据中开发和内部验证模型,并在 UW 数据中进行外部验证。使用交叉验证来调整模型的超参数。通过准确性、敏感性和特异性来比较模型的性能。
在内部验证中,LR 的准确性(0.732;95%CI,0.698 至 0.764)、敏感性(0.697;95%CI,0.647 至 0.744)和特异性(0.767;95%CI,0.720 至 0.809)与惩罚型套索、支持向量机、K-最近邻和随机森林模型相当。在外部验证中,LR 继续表现出高准确性、敏感性和特异性。虽然 K-最近邻的准确性和特异性更高,但敏感性明显较低。在评估标准方面,没有一种 ML 模型始终优于 LR。
从两个独立的 EHR 中提取数据,我们发现基于已确立的风险因素的 LR 与优化的 ML 算法具有相当的性能。本研究表明,基于稳健、人工选择的预测变量构建的经典模型在 NCGC 风险预测方面可能并不逊于基于数据的模型。