• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

结合机器学习和传统统计方法在大型队列研究中发现风险因素。

Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study.

机构信息

Australian Centre for Precision Health, UniSA Clinical and Health Sciences, University of South Australia, Adelaide, Australia.

Computational Learning Systems Laboratory, UniSA STEM, University of South Australia, Mawson Lakes, Australia.

出版信息

Sci Rep. 2021 Nov 26;11(1):22997. doi: 10.1038/s41598-021-02476-9.

DOI:10.1038/s41598-021-02476-9
PMID:34837000
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8626442/
Abstract

We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37-73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors 'hidden' within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.

摘要

我们提出了一种简单而有效的机器学习管道,用于在大型生物医学数据库中发现风险因素,该管道无需进行大量的变量预处理,即可考虑非线性和交互作用。在这项研究中,使用梯度提升决策树 (GBDT) 构建了死亡率模型,并使用基于 Shapley 值的特征归因方法 (SHAP 值) 确定了重要的预测因子。使用校正了虚假发现率的 Cox 模型进行混杂因素调整、可解释性和进一步验证。该管道使用来自 502,506 名英国生物库参与者的信息进行了测试,这些参与者在招募时年龄为 37-73 岁,并在七年多的时间里进行了死亡率登记。在 GBDT 中包含的 11,639 个预测因子中,有 193 个潜在风险因素的 SHAP 值≥0.05,通过了相关性检验,并被选为进一步建模。在总变量重要性中,60%直接与健康相关,基线特征、社会人口统计学和生活方式因素各占 10%左右。调整了基线特征的 Cox 模型显示,在 193 个预测因子中有 166 个与死亡率有显著关联。其中包括大多数众所周知的风险因素(例如年龄、性别、种族、教育、物质剥夺、吸烟、身体活动、自我报告的健康状况、BMI 以及许多疾病结局)。对于 19 个预测因子,我们在未调整的分析中看到了与死亡率相关的证据,但在调整后的分析中则没有,这表明混杂因素偏倚。我们的 GBDT-SHAP 管道能够识别隐藏在数千个变量中的相关预测因子,为无假设风险因素识别的第一阶段提供了一种高效实用的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/931b389b3da2/41598_2021_2476_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/f8dfe89d73b7/41598_2021_2476_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/901ec0fb0ca2/41598_2021_2476_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/7e02d4cb4f38/41598_2021_2476_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/931b389b3da2/41598_2021_2476_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/f8dfe89d73b7/41598_2021_2476_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/901ec0fb0ca2/41598_2021_2476_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/7e02d4cb4f38/41598_2021_2476_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a997/8626442/931b389b3da2/41598_2021_2476_Fig4_HTML.jpg

相似文献

1
Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study.结合机器学习和传统统计方法在大型队列研究中发现风险因素。
Sci Rep. 2021 Nov 26;11(1):22997. doi: 10.1038/s41598-021-02476-9.
2
Uncovering Predictors of Low Hippocampal Volume: Evidence from a Large-Scale Machine-Learning-Based Study in the UK Biobank.揭示海马体体积偏低的预测因素:来自英国生物银行一项基于大规模机器学习研究的证据。
Neuroepidemiology. 2024;58(5):369-382. doi: 10.1159/000538565. Epub 2024 Apr 1.
3
The effect of socioeconomic deprivation on the association between an extended measurement of unhealthy lifestyle factors and health outcomes: a prospective analysis of the UK Biobank cohort.社会经济剥夺对不健康生活方式因素的广泛测量与健康结果之间关联的影响:对英国生物库队列的前瞻性分析。
Lancet Public Health. 2018 Dec;3(12):e576-e585. doi: 10.1016/S2468-2667(18)30200-7. Epub 2018 Nov 20.
4
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
5
Uncovering Clinical Risk Factors and Predicting Severe COVID-19 Cases Using UK Biobank Data: Machine Learning Approach.利用英国生物库数据揭示临床风险因素并预测严重 COVID-19 病例:机器学习方法。
JMIR Public Health Surveill. 2021 Sep 30;7(9):e29544. doi: 10.2196/29544.
6
Fracture risk prediction in postmenopausal women with traditional and machine learning models in a nationwide, prospective cohort study in Switzerland with validation in the UK Biobank.在瑞士进行的一项全国性前瞻性队列研究中,使用传统和机器学习模型对绝经后妇女进行骨折风险预测,并在英国生物库中进行验证。
J Bone Miner Res. 2024 Aug 21;39(8):1103-1112. doi: 10.1093/jbmr/zjae089.
7
Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者?
Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.
8
Predictors of 30-Day Mortality Among Dutch Patients Undergoing Colorectal Cancer Surgery, 2011-2016.2011-2016 年荷兰结直肠癌手术患者 30 天死亡率的预测因素。
JAMA Netw Open. 2021 Apr 1;4(4):e217737. doi: 10.1001/jamanetworkopen.2021.7737.
9
Genetic, clinical, lifestyle and sociodemographic risk factors for head and neck cancer: A UK Biobank study.头颈癌的遗传、临床、生活方式及社会人口统计学风险因素:一项英国生物银行研究
PLoS One. 2025 Apr 4;20(4):e0318889. doi: 10.1371/journal.pone.0318889. eCollection 2025.
10
Identifying Psychosocial and Ecological Determinants of Enthusiasm In Youth: Integrative Cross-Sectional Analysis Using Machine Learning.识别青少年热情的心理社会和生态决定因素:使用机器学习的综合横断面分析。
JMIR Public Health Surveill. 2024 Sep 12;10:e48705. doi: 10.2196/48705.

引用本文的文献

1
Decoding the adolescent non-suicidal self-injury: understanding with interpretable machine learning insights.解码青少年非自杀性自伤行为:借助可解释的机器学习见解进行理解
BMC Public Health. 2025 Sep 1;25(1):2994. doi: 10.1186/s12889-025-24354-z.
2
Factors associated with admission to elderly medical-welfare facilities in South Korea: a cross-sectional machine-learning study.韩国老年医疗福利设施入住相关因素:一项横断面机器学习研究。
BMJ Open. 2025 Aug 31;15(8):e093591. doi: 10.1136/bmjopen-2024-093591.
3
An Introduction to Machine Learning for Speech-Language Pathologists: Concepts, Terminology, and Emerging Applications.

本文引用的文献

1
Evolutionarily informed machine learning enhances the power of predictive gene-to-phenotype relationships.进化信息机器学习增强了预测基因与表型关系的能力。
Nat Commun. 2021 Sep 24;12(1):5627. doi: 10.1038/s41467-021-25893-w.
2
NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning.神经肽预测模型 FRL:基于特征表示学习的神经肽识别可解释预测模型。
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab167.
3
Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment.
面向言语语言病理学家的机器学习导论:概念、术语及新兴应用
Perspect ASHA Spec Interest Groups. 2025 Apr;10(2):432-450. doi: 10.1044/2024_persp-24-00037. Epub 2025 Apr 1.
4
Regression and machine learning approaches identify potential risk factors for glioblastoma multiforme.回归分析和机器学习方法可识别多形性胶质母细胞瘤的潜在风险因素。
Brain Commun. 2025 May 27;7(3):fcaf187. doi: 10.1093/braincomms/fcaf187. eCollection 2025.
5
Machine learning to discover factors predicting volume of white matter hyperintensities: Insights from the UK Biobank.机器学习用于发现预测脑白质高信号体积的因素:来自英国生物银行的见解。
Alzheimers Dement (Amst). 2025 Mar 25;17(1):e70090. doi: 10.1002/dad2.70090. eCollection 2025 Jan-Mar.
6
Identifying novel risk factors for aneurysmal subarachnoid haemorrhage using machine learning.使用机器学习识别动脉瘤性蛛网膜下腔出血的新危险因素。
Sci Rep. 2025 Mar 18;15(1):9256. doi: 10.1038/s41598-025-88826-3.
7
Sociodemographic and health-related determinants of influenza vaccine nonreceipt among US adults: A cross-sectional analysis of the 2022 National Health Interview Survey.美国成年人未接种流感疫苗的社会人口学及健康相关决定因素:对2022年国家健康访谈调查的横断面分析
Medicine (Baltimore). 2025 Mar 14;104(11):e41854. doi: 10.1097/MD.0000000000041854.
8
Assessing regional competitiveness in Peru: An approach using nonlinear machine learning models.评估秘鲁的区域竞争力:一种使用非线性机器学习模型的方法。
PLoS One. 2025 Feb 25;20(2):e0318813. doi: 10.1371/journal.pone.0318813. eCollection 2025.
9
Assessing Hwa-byung Vulnerability Using the Hwa-byung Personality Scale: a comparative study of machine learning approaches.使用火病个性量表评估火病易感性:机器学习方法的比较研究
J Pharmacopuncture. 2024 Dec 31;27(4):358-366. doi: 10.3831/KPI.2024.27.4.358.
10
Impact of sleep duration and dietary patterns on risk of metabolic syndrome in middle-aged and elderly adults: a cross-sectional study from a survey in Anhui, Eastern China.睡眠时间和饮食模式对中老年人群代谢综合征风险的影响:来自中国东部安徽省调查的横断面研究。
Lipids Health Dis. 2024 Nov 5;23(1):361. doi: 10.1186/s12944-024-02354-z.
医学中传统统计方法与机器学习的比较:诊断、药物研发与治疗
Medicina (Kaunas). 2020 Sep 8;56(9):455. doi: 10.3390/medicina56090455.
4
Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework.Meta-i6mA:利用集成机器学习框架中的信息特征,用于识别植物基因组中 DNA N6-甲基腺嘌呤位点的种间预测因子。
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa202.
5
Multicenter validation of a machine-learning algorithm for 48-h all-cause mortality prediction.一种用于预测48小时全因死亡率的机器学习算法的多中心验证
Health Informatics J. 2020 Sep;26(3):1912-1925. doi: 10.1177/1460458219894494. Epub 2019 Dec 30.
6
A deep learning model for real-time mortality prediction in critically ill children.深度学习模型实时预测危重症儿童死亡率。
Crit Care. 2019 Aug 14;23(1):279. doi: 10.1186/s13054-019-2561-z.
7
Training machine learning models to predict 30-day mortality in patients discharged from the emergency department: a retrospective, population-based registry study.训练机器学习模型预测急诊科出院患者的30天死亡率:一项基于人群的回顾性登记研究。
BMJ Open. 2019 Aug 10;9(8):e028015. doi: 10.1136/bmjopen-2018-028015.
8
Extensive phenotype data and machine learning in prediction of mortality in acute coronary syndrome - the MADDEC study.急性冠状动脉综合征患者死亡率预测中的广泛表型数据和机器学习 - MADDEC 研究。
Ann Med. 2019 Mar;51(2):156-163. doi: 10.1080/07853890.2019.1596302. Epub 2019 Apr 27.
9
Prediction of premature all-cause mortality: A prospective general population cohort study comparing machine-learning and standard epidemiological approaches.预测全因过早死亡:一项比较机器学习和标准流行病学方法的前瞻性一般人群队列研究。
PLoS One. 2019 Mar 27;14(3):e0214365. doi: 10.1371/journal.pone.0214365. eCollection 2019.
10
Robust clinical marker identification for diabetic kidney disease with ensemble feature selection.基于集成特征选择的糖尿病肾病稳健临床标志物识别。
J Am Med Inform Assoc. 2019 Mar 1;26(3):242-253. doi: 10.1093/jamia/ocy165.