基于真实世界临床数据流的胃癌风险预测中逻辑回归与机器学习算法的比较。

A Comparison of Logistic Regression Against Machine Learning Algorithms for Gastric Cancer Risk Prediction Within Real-World Clinical Data Streams.

机构信息

Division of Gastroenterology and Hepatology, Stanford University School of Medicine, Stanford, CA.

Division of Gastroenterology, University of Washington, Seattle, WA.

出版信息

JCO Clin Cancer Inform. 2022 Jun;6:e2200039. doi: 10.1200/CCI.22.00039.

DOI:10.1200/CCI.22.00039

PMID:35763703

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9259116/

Abstract

PURPOSE

Noncardia gastric cancer (NCGC) is a leading cause of global cancer mortality, and is often diagnosed at advanced stages. Development of NCGC risk models within electronic health records (EHR) may allow for improved cancer prevention. There has been much recent interest in use of machine learning (ML) for cancer prediction, but few studies comparing ML with classical statistical models for NCGC risk prediction.

METHODS

We trained models using logistic regression (LR) and four commonly used ML algorithms to predict NCGC from age-/sex-matched controls in two EHR systems: Stanford University and the University of Washington (UW). The LR model contained well-established NCGC risk factors (intestinal metaplasia histology, prior infection, race, ethnicity, nativity status, smoking history, anemia), whereas ML models agnostically selected variables from the EHR. Models were developed and internally validated in the Stanford data, and externally validated in the UW data. Hyperparameter tuning of models was achieved using cross-validation. Model performance was compared by accuracy, sensitivity, and specificity.

RESULTS

In internal validation, LR performed with comparable accuracy (0.732; 95% CI, 0.698 to 0.764), sensitivity (0.697; 95% CI, 0.647 to 0.744), and specificity (0.767; 95% CI, 0.720 to 0.809) to penalized lasso, support vector machine, K-nearest neighbor, and random forest models. In external validation, LR continued to demonstrate high accuracy, sensitivity, and specificity. Although K-nearest neighbor demonstrated higher accuracy and specificity, this was offset by significantly lower sensitivity. No ML model consistently outperformed LR across evaluation criteria.

CONCLUSION

Drawing data from two independent EHRs, we find LR on the basis of established risk factors demonstrated comparable performance to optimized ML algorithms. This study demonstrates that classical models built on robust, hand-chosen predictor variables may not be inferior to data-driven models for NCGC risk prediction.

摘要

目的

非贲门胃癌（NCGC）是全球癌症死亡的主要原因，且通常在晚期诊断。在电子健康记录（EHR）中开发 NCGC 风险模型可能有助于改善癌症预防。最近，人们对使用机器学习（ML）进行癌症预测产生了浓厚的兴趣，但很少有研究将 ML 与用于 NCGC 风险预测的经典统计模型进行比较。

方法

我们使用逻辑回归（LR）和四种常用的 ML 算法在斯坦福大学和华盛顿大学（UW）的两个 EHR 系统中从年龄/性别匹配的对照中训练预测 NCGC 的模型。LR 模型包含已确立的 NCGC 风险因素（肠化生组织学、既往感染、种族、民族、原籍国状况、吸烟史、贫血），而 ML 模型则从 EHR 中盲目选择变量。在斯坦福大学的数据中开发和内部验证模型，并在 UW 数据中进行外部验证。使用交叉验证来调整模型的超参数。通过准确性、敏感性和特异性来比较模型的性能。

结果

在内部验证中，LR 的准确性（0.732；95%CI，0.698 至 0.764）、敏感性（0.697；95%CI，0.647 至 0.744）和特异性（0.767；95%CI，0.720 至 0.809）与惩罚型套索、支持向量机、K-最近邻和随机森林模型相当。在外部验证中，LR 继续表现出高准确性、敏感性和特异性。虽然 K-最近邻的准确性和特异性更高，但敏感性明显较低。在评估标准方面，没有一种 ML 模型始终优于 LR。

结论

从两个独立的 EHR 中提取数据，我们发现基于已确立的风险因素的 LR 与优化的 ML 算法具有相当的性能。本研究表明，基于稳健、人工选择的预测变量构建的经典模型在 NCGC 风险预测方面可能并不逊于基于数据的模型。

相似文献

A Comparison of Logistic Regression Against Machine Learning Algorithms for Gastric Cancer Risk Prediction Within Real-World Clinical Data Streams.基于真实世界临床数据流的胃癌风险预测中逻辑回归与机器学习算法的比较。

JCO Clin Cancer Inform. 2022 Jun;6:e2200039. doi: 10.1200/CCI.22.00039.

The U-shaped association between body mass index and gastric cancer risk in the Helicobacter pylori Biomarker Cohort Consortium: A nested case-control study from eight East Asian cohort studies.U 形关联：体质量指数与幽门螺杆菌生物标志物队列联盟中胃癌风险的关联：来自 8 项东亚队列研究的巢式病例对照研究。

Int J Cancer. 2020 Aug 1;147(3):777-784. doi: 10.1002/ijc.32790. Epub 2019 Dec 12.

Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach.幽门螺杆菌（H. pylori）危险因素分析与流行预测：基于机器学习的方法。

BMC Infect Dis. 2022 Jul 28;22(1):655. doi: 10.1186/s12879-022-07625-7.

Applications of machine learning models in the prediction of gastric cancer risk in patients after Helicobacter pylori eradication.机器学习模型在预测幽门螺杆菌根除后胃癌风险中的应用。

Aliment Pharmacol Ther. 2021 Apr;53(8):864-872. doi: 10.1111/apt.16272. Epub 2021 Jan 24.

Effect of Helicobacter pylori Eradication and ABO Genotype on Gastric Cancer Development.幽门螺杆菌根除及ABO基因型对胃癌发生的影响

Helicobacter. 2016 Dec;21(6):596-605. doi: 10.1111/hel.12317. Epub 2016 May 18.

Different etiological role of Helicobacter pylori (Hp) infection in carcinogenesis between differentiated and undifferentiated gastric cancers: a nested case-control study using IgG titer against Hp surface antigen.幽门螺杆菌（Hp）感染在分化型和未分化型胃癌致癌过程中的不同病因学作用：一项针对Hp表面抗原IgG滴度的巢式病例对照研究

Acta Oncol. 2008;47(3):360-5. doi: 10.1080/02841860701843035.

Establishing machine learning models to predict the early risk of gastric cancer based on lifestyle factors.基于生活方式因素建立机器学习模型预测胃癌早期风险。

BMC Gastroenterol. 2023 Jan 10;23(1):6. doi: 10.1186/s12876-022-02626-x.

Development and validation of sex-specific hip fracture prediction models using electronic health records: a retrospective, population-based cohort study.利用电子健康记录开发和验证特定性别的髋部骨折预测模型：一项基于人群的回顾性队列研究。

EClinicalMedicine. 2023 Feb 27;58:101876. doi: 10.1016/j.eclinm.2023.101876. eCollection 2023 Apr.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略：以脑出血为例。

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

Machine learning outperformed logistic regression classification even with limit sample size: A model to predict pediatric HIV mortality and clinical progression to AIDS.机器学习在样本量有限的情况下表现优于逻辑回归分类：预测儿科 HIV 死亡率和临床进展为艾滋病的模型。

PLoS One. 2022 Oct 14;17(10):e0276116. doi: 10.1371/journal.pone.0276116. eCollection 2022.

引用本文的文献

The development and validation of postpartum hemorrhage prediction models for pregnancies with placenta previa totalis based on coagulation function indexes: a retrospective cohort study.基于凝血功能指标的完全性前置胎盘妊娠产后出血预测模型的构建与验证：一项回顾性队列研究

BMC Pregnancy Childbirth. 2025 Sep 2;25(1):925. doi: 10.1186/s12884-025-08066-1.

Diagnostic Risk Prediction Models for Upper Gastrointestinal Cancers: A Systematic Review.上消化道癌症的诊断风险预测模型：一项系统评价

Cancer Epidemiol Biomarkers Prev. 2025 Aug 1;34(8):1240-1251. doi: 10.1158/1055-9965.EPI-24-1714.

Leveraging near-real-time patient and population data to incorporate fluctuating risk of severe COVID-19: development and prospective validation of a personalised risk prediction tool.利用近乎实时的患者和人群数据纳入新冠重症波动风险：一种个性化风险预测工具的开发与前瞻性验证

EClinicalMedicine. 2025 Feb 21;81:103114. doi: 10.1016/j.eclinm.2025.103114. eCollection 2025 Mar.

A Machine Learning Risk Prediction Model for Gastric Cancer with SHapley Additive exPlanations.一种基于SHapley值加法解释的胃癌机器学习风险预测模型。

Cancer Res Treat. 2024 Dec 16. doi: 10.4143/crt.2024.843.

Using the Electronic Health Record to Develop a Gastric Cancer Risk Prediction Model.利用电子健康记录开发胃癌风险预测模型。

Gastro Hep Adv. 2024 Jul 14;3(7):910-916. doi: 10.1016/j.gastha.2024.07.001. eCollection 2024.

Study of Text Patterns Found on Social Networks of Mental Health Reactions to COVID-19.关于在心理健康对新冠疫情反应的社交网络上发现的文本模式的研究。

Acta Inform Med. 2023;32(1):15-18. doi: 10.5455/aim.2024.32.15-18.

Risk of Gastric Adenocarcinoma in a Multiethnic Population Undergoing Routine Care: An Electronic Health Records Cohort Study.多民族人群在常规护理下发生胃腺癌的风险：一项电子健康记录队列研究。

Cancer Epidemiol Biomarkers Prev. 2024 Apr 3;33(4):547-556. doi: 10.1158/1055-9965.EPI-23-1200.

Challenges involved in the application of artificial intelligence in gastroenterology: The race is on!人工智能在消化内科应用中面临的挑战：比赛开始了！

World J Gastroenterol. 2023 Dec 28;29(48):6168-6178. doi: 10.3748/wjg.v29.i48.6168.

A Framework for Prediction of Oncogenomic Progression Aiding Personalized Treatment of Gastric Cancer.一种辅助胃癌个性化治疗的肿瘤基因组进展预测框架。

Diagnostics (Basel). 2023 Jul 6;13(13):2291. doi: 10.3390/diagnostics13132291.

Bottom-up and top-down paradigms of artificial intelligence research approaches to healthcare data science using growing real-world big data.利用不断增长的现实世界大数据，从人工智能研究方法的自下而上和自上而下范式角度，研究医疗保健领域的数据科学。

J Am Med Inform Assoc. 2023 Jun 20;30(7):1323-1332. doi: 10.1093/jamia/ocad085.

本文引用的文献

Machine Learning Applied to Electronic Health Records: Identification of Chemotherapy Patients at High Risk for Preventable Emergency Department Visits and Hospital Admissions.机器学习在电子健康记录中的应用：识别化疗患者中预防可避免急诊就诊和住院的高风险患者。

JCO Clin Cancer Inform. 2021 Oct;5:1106-1126. doi: 10.1200/CCI.21.00116.

An Approach to the Primary and Secondary Prevention of Gastric Cancer in the United States.美国胃癌的一级和二级预防方法。

Clin Gastroenterol Hepatol. 2022 Oct;20(10):2218-2228.e2. doi: 10.1016/j.cgh.2021.09.039. Epub 2021 Oct 6.

Can we screen for pancreatic cancer? Identifying a sub-population of patients at high risk of subsequent diagnosis using machine learning techniques applied to primary care data.我们能否对胰腺癌进行筛查？利用机器学习技术对初级保健数据进行分析，确定后续诊断中高危患者的亚人群。

PLoS One. 2021 Jun 2;16(6):e0251876. doi: 10.1371/journal.pone.0251876. eCollection 2021.

Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis.机器学习和逻辑回归模型在预测急性肾损伤中的比较：系统评价和荟萃分析。

Int J Med Inform. 2021 Jul;151:104484. doi: 10.1016/j.ijmedinf.2021.104484. Epub 2021 May 8.

Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.《全球癌症统计数据 2020：全球 185 个国家和地区 36 种癌症的发病率和死亡率估计》。

CA Cancer J Clin. 2021 May;71(3):209-249. doi: 10.3322/caac.21660. Epub 2021 Feb 4.

Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study.利用电子健康记录为一般人群开发和验证胰腺癌风险模型：一项观察性研究。

Eur J Cancer. 2021 Jan;143:19-30. doi: 10.1016/j.ejca.2020.10.019. Epub 2020 Dec 2.

Population-Based Analysis of Differences in Gastric Cancer Incidence Among Races and Ethnicities in Individuals Age 50 Years and Older.基于人群的 50 岁及以上个体中不同种族和民族间胃癌发病率差异的分析。

Gastroenterology. 2020 Nov;159(5):1705-1714.e2. doi: 10.1053/j.gastro.2020.07.049. Epub 2020 Aug 6.

A Summary of the 2020 Gastric Cancer Summit at Stanford University.斯坦福大学 2020 年胃癌峰会纪要。

Gastroenterology. 2020 Oct;159(4):1221-1226. doi: 10.1053/j.gastro.2020.05.100. Epub 2020 Jul 21.

Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults.在临床环境中，逻辑回归与优化的机器学习算法具有相似的性能：应用于区分年轻成年人的1型和2型糖尿病。

Diagn Progn Res. 2020 Jun 4;4:6. doi: 10.1186/s41512-020-00075-2. eCollection 2020.

MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care.MINIMAR（医疗人工智能报告的最小信息）：制定医疗人工智能报告的标准。

J Am Med Inform Assoc. 2020 Dec 9;27(12):2011-2015. doi: 10.1093/jamia/ocaa088.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验