利用机器学习对生活方式和基因数据进行癌症风险预测。

Predicting cancer risk using machine learning on lifestyle and genetic data.

作者信息

Ahmed Mohamed Abdelmoaty, AbdelMoety Ahmed, Soliman Asmaa Mohamed Ahmed

机构信息

Faculty of Medicine, Merit University, Sohag, Egypt.

Electrical Engineering Department, Faculty of Engineering, South Valley University, Qena, 83523, Egypt.

出版信息

Sci Rep. 2025 Aug 19;15(1):30458. doi: 10.1038/s41598-025-15656-8.

DOI:10.1038/s41598-025-15656-8

PMID:40830557

Abstract

Cancer remains one of the leading causes of mortality worldwide, where early detection significantly improves patient outcomes and reduces treatment burden. This study investigates the application of Machine Learning (ML) techniques to predict cancer risk based on a combination of genetic and lifestyle factors. A structured dataset of 1,200 patient records was used, comprising features such as age, gender, Body Mass Index (BMI), smoking status, alcohol intake, physical activity, genetic risk level, and personal history of cancer. A full end-to-end ML pipeline was implemented, encompassing data exploration, preprocessing, feature scaling, model training, and evaluation using stratified cross-validation and a separate test set. Nine supervised learning algorithms were evaluated and compared, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machines (SVMs), and several ensemble methods. Among these, Categorical Boosting (CatBoost) achieved the highest predictive performance, with a test accuracy of 98.75% and an F1-score of 0.9820, outperforming both traditional and other advanced models. Feature importance analysis confirmed the strong influence of cancer history, genetic risk, and smoking status on prediction outcomes. The findings highlight the effectiveness of boosting-based ensemble models in capturing complex interactions within health data and support their potential use in personalized cancer risk assessment. This research underscores the value of integrating genetic and modifiable lifestyle variables into predictive modeling to enhance early detection and preventive healthcare strategies.

摘要

癌症仍然是全球主要死因之一，早期检测能显著改善患者预后并减轻治疗负担。本研究探讨了机器学习（ML）技术在基于遗传和生活方式因素组合预测癌症风险方面的应用。使用了一个包含1200份患者记录的结构化数据集，其特征包括年龄、性别、体重指数（BMI）、吸烟状况、酒精摄入量、身体活动、遗传风险水平和癌症个人史。实施了一个完整的端到端ML流程，包括数据探索、预处理、特征缩放、模型训练以及使用分层交叉验证和单独测试集进行评估。评估并比较了九种监督学习算法，包括逻辑回归（LR）、决策树（DT）、随机森林（RF）、支持向量机（SVM）以及几种集成方法。其中，分类提升（CatBoost）实现了最高的预测性能，测试准确率为98.75%，F1分数为0.9820，优于传统模型和其他先进模型。特征重要性分析证实了癌症病史、遗传风险和吸烟状况对预测结果的强烈影响。研究结果突出了基于提升的集成模型在捕捉健康数据中复杂相互作用方面的有效性，并支持它们在个性化癌症风险评估中的潜在应用。本研究强调了将遗传和可改变的生活方式变量纳入预测模型以加强早期检测和预防性医疗保健策略的价值。