Xiao Xingjian, Yi Xiaohan, Soe Nyi Nyi, Latt Phyu Mon, Lin Luotao, Chen Xuefen, Song Hualing, Sun Bo, Zhao Hailei, Xu Xianglong
School of Public Health, Shanghai University of Traditional Chinese Medicine, Shanghai, China.
School of Translational Medicine, Faculty of Medicine, Nursing and Health Sciences, Monash University, Clayton, VIC, Australia; Artificial Intelligence and Modelling in Epidemiology Program, Melbourne Sexual Health Centre, Alfred Health, Carlton, VIC, Australia.
Ann Epidemiol. 2025 Jan;101:27-35. doi: 10.1016/j.annepidem.2024.12.003. Epub 2024 Dec 13.
From a global perspective, China is one of the countries with higher incidence and mortality rates for cancer.
Our objective is to create an online cancer risk prediction tool for middle-aged and elderly Chinese adults by leveraging machine learning algorithms and self-reported data.
Drawing from a cohort of 19,798 participants aged 45 and above from the China Health and Retirement Longitudinal Study (2011 - 2018), we employed nine machine learning algorithms (LR: Logistic Regression, Adaboost: Adaptive Boosting, SVM: Support Vector Machine, RF: Random Forest, GNB: Gaussian Naive Bayes, GBM: Gradient Boosting Machine, LGBM: Light Gradient Boosting Machine, XGBoost: eXtreme Gradient Boosting, KNN: K - Nearest Neighbors), which are mainly used for classification and regression tasks, to construct predictive models for various cancers. Utilizing non-invasive self-reported predictors encompassing demographic, educational, marital, lifestyle, health history, and other factors, we focused on predicting "Cancer or Malignant Tumour" outcomes. The types of cancers that can be predicted mainly include lung cancer, breast cancer, cervical cancer, colorectal cancer, gastric cancer, esophageal cancer, and other rare cancers.
The developed tool, MyCancerRisk, demonstrated significant performance, with the Random Forest algorithm achieving an AUC of 0.75 and ACC of 0.99 using self-reported variables. Key predictors identified include age, self-rated health, sleep patterns, household heating sources, childhood health status, living conditions, and smoking habits.
MyCancerRisk aims to serve as a preventative screening tool, encouraging individuals to undergo testing and adopt healthier behaviours to mitigate the public health impact of cancer. Our study also sheds light on unconventional predictors, such as housing conditions, offering valuable insights for refining cancer prediction models.
从全球角度来看,中国是癌症发病率和死亡率较高的国家之一。
我们的目标是利用机器学习算法和自我报告数据,为中国中老年成年人创建一个在线癌症风险预测工具。
我们从中国健康与养老追踪调查(2011 - 2018年)中选取了19798名45岁及以上的参与者作为队列,采用了九种主要用于分类和回归任务的机器学习算法(LR:逻辑回归、Adaboost:自适应增强、SVM:支持向量机、RF:随机森林、GNB:高斯朴素贝叶斯、GBM:梯度提升机、LGBM:轻量级梯度提升机、XGBoost:极端梯度提升、KNN:K近邻)来构建各种癌症的预测模型。利用包括人口统计学、教育程度、婚姻状况、生活方式、健康史等因素在内的非侵入性自我报告预测指标,我们专注于预测“癌症或恶性肿瘤”结果。可预测的癌症类型主要包括肺癌、乳腺癌、宫颈癌、结直肠癌、胃癌、食管癌以及其他罕见癌症。
开发的工具MyCancerRisk表现出显著性能,随机森林算法使用自我报告变量时AUC为0.75,ACC为0.99。确定的关键预测指标包括年龄、自我健康评分、睡眠模式、家庭供暖来源、童年健康状况、生活条件和吸烟习惯。
MyCancerRisk旨在作为一种预防性筛查工具,鼓励个人进行检测并采取更健康的行为,以减轻癌症对公共卫生的影响。我们的研究还揭示了一些非常规预测指标,如住房条件,为完善癌症预测模型提供了有价值的见解。