Australian Centre for Precision Health, UniSA Clinical and Health Sciences, University of South Australia, Adelaide, Australia.
Computational Learning Systems Laboratory, UniSA STEM, University of South Australia, Mawson Lakes, Australia.
Sci Rep. 2021 Nov 26;11(1):22997. doi: 10.1038/s41598-021-02476-9.
We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37-73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors 'hidden' within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.
我们提出了一种简单而有效的机器学习管道,用于在大型生物医学数据库中发现风险因素,该管道无需进行大量的变量预处理,即可考虑非线性和交互作用。在这项研究中,使用梯度提升决策树 (GBDT) 构建了死亡率模型,并使用基于 Shapley 值的特征归因方法 (SHAP 值) 确定了重要的预测因子。使用校正了虚假发现率的 Cox 模型进行混杂因素调整、可解释性和进一步验证。该管道使用来自 502,506 名英国生物库参与者的信息进行了测试,这些参与者在招募时年龄为 37-73 岁,并在七年多的时间里进行了死亡率登记。在 GBDT 中包含的 11,639 个预测因子中,有 193 个潜在风险因素的 SHAP 值≥0.05,通过了相关性检验,并被选为进一步建模。在总变量重要性中,60%直接与健康相关,基线特征、社会人口统计学和生活方式因素各占 10%左右。调整了基线特征的 Cox 模型显示,在 193 个预测因子中有 166 个与死亡率有显著关联。其中包括大多数众所周知的风险因素(例如年龄、性别、种族、教育、物质剥夺、吸烟、身体活动、自我报告的健康状况、BMI 以及许多疾病结局)。对于 19 个预测因子,我们在未调整的分析中看到了与死亡率相关的证据,但在调整后的分析中则没有,这表明混杂因素偏倚。我们的 GBDT-SHAP 管道能够识别隐藏在数千个变量中的相关预测因子,为无假设风险因素识别的第一阶段提供了一种高效实用的解决方案。