Qi Xiangjun, Wang Shujing, Fang Caishan, Jia Jie, Lin Lizhu, Yuan Tianhui
The First Clinical Medical College, Guangzhou University of Chinese Medicine, Guangzhou, 510000, China.
Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, 610031, China; Yong Loo Lin School of Medicine, National University of Singapore, 117597, Singapore.
Redox Biol. 2025 Feb;79:103470. doi: 10.1016/j.redox.2024.103470. Epub 2024 Dec 16.
To develop and validate a machine learning model incorporating dietary antioxidants to predict cardiovascular disease (CVD)-cancer comorbidity and to elucidate the role of antioxidants in disease prediction.
Data were sourced from the National Health and Nutrition Examination Survey. Antioxidants, including vitamins, minerals, and polyphenols, were selected as key features. Additionally, demographic, lifestyle, and health condition features were incorporated to improve model accuracy. Feature preprocessing included removing collinear features, addressing class imbalance, and normalizing data. Models constructed within the mlr3 framework included recursive partitioning and regression trees, random forest, kernel k-nearest neighbors, naïve bayes, and light gradient boosting machine (LightGBM). Benchmarking provided a systematic approach to evaluating and comparing model performance. SHapley Additive exPlanation (SHAP) values were calculated to determine the prediction role of each feature in the model with the highest predictive performance.
This analysis included 10,064 participants, with 353 identified as having comorbid CVD and cancer. After excluding collinear features, the machine learning model retained 29 dietary antioxidant features and 9 baseline features. LightGBM achieved the highest predictive accuracy at 87.9 %, a classification error rate of 12.1 %, and the top area under the receiver operating characteristic curve (0.951) and the precision-recall curve (0.930). LightGBM also demonstrated balanced sensitivity and specificity, both close to 88 %. SHAP analysis indicated that naringenin, magnesium, theaflavin, kaempferol, hesperetin, selenium, malvidin, and vitamin C were the most influential contributors.
LightGBM exhibited the best performance for predicting CVD-cancer comorbidity. SHAP values highlighted the importance of antioxidants, with naringenin and magnesium emerging as primary factors in this model.
开发并验证一个纳入膳食抗氧化剂的机器学习模型,以预测心血管疾病(CVD)-癌症合并症,并阐明抗氧化剂在疾病预测中的作用。
数据来源于美国国家健康与营养检查调查。选择包括维生素、矿物质和多酚在内的抗氧化剂作为关键特征。此外,纳入人口统计学、生活方式和健康状况特征以提高模型准确性。特征预处理包括去除共线特征、解决类别不平衡问题以及对数据进行归一化处理。在mlr3框架内构建的模型包括递归划分与回归树、随机森林、核k近邻、朴素贝叶斯和轻梯度提升机(LightGBM)。基准测试提供了一种评估和比较模型性能的系统方法。计算SHapley加性解释(SHAP)值,以确定每个特征在预测性能最高的模型中的预测作用。
该分析纳入了10,064名参与者,其中353人被确定患有CVD和癌症合并症。排除共线特征后,机器学习模型保留了29个膳食抗氧化剂特征和9个基线特征。LightGBM的预测准确率最高,为87.9%,分类错误率为12.1%,在受试者工作特征曲线下面积(0.951)和精确召回率曲线下面积(0.930)方面排名第一。LightGBM还表现出平衡的敏感性和特异性,均接近88%。SHAP分析表明,柚皮素、镁、茶黄素、山奈酚、橙皮素、硒、锦葵色素和维生素C是最具影响力的因素。
LightGBM在预测CVD-癌症合并症方面表现出最佳性能。SHAP值突出了抗氧化剂的重要性,柚皮素和镁成为该模型中的主要因素。