• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

社会环境数据中的变量选择:稀疏回归和树集成机器学习方法。

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.

机构信息

Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Reimann 383, 333 Cottman Ave, Philadelphia, PA, 19111, USA.

Cancer Prevention and Control, Fox Chase Cancer Center, Young Pavilion, 333 Cottman Ave, Philadelphia, PA, 19111, USA.

出版信息

BMC Med Res Methodol. 2020 Dec 10;20(1):302. doi: 10.1186/s12874-020-01183-9.

DOI:10.1186/s12874-020-01183-9
PMID:33302880
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7727197/
Abstract

BACKGROUND

Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.

METHODS

We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.

RESULTS

In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.

CONCLUSIONS

This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.

摘要

背景

美国人口普查获得的社会环境数据是了解健康差异的重要资源,但很少有分析利用完整的数据集。将完整数据纳入分析的一个障碍是缺乏可靠的变量选择建议,研究人员通常会手动选择少数几个变量。因此,我们评估了经验机器学习方法识别与健康结果真正相关的社会环境因素的能力。

方法

我们比较了几种流行的机器学习方法,包括惩罚回归(例如lasso、弹性网络)和树集成方法。通过模拟,我们评估了这些方法在最小化假阳性结果(10 个真实关联,1000 个总变量)的情况下识别与二分类和连续结果真正相关的人口普查变量的能力。我们将最有前途的方法应用于与前列腺癌登记数据(n=76186 例)相关的完整人口普查数据(p=14663 个变量),以识别与晚期前列腺癌相关的社会环境因素。

结果

在模拟中,我们发现弹性网络识别出了许多真正的阳性变量,而lasso 则很好地控制了假阳性。使用准确性的综合衡量标准,基于 Spearman 相关性的层次聚类和稀疏组套索回归表现最佳。贝叶斯自适应回归树优于其他树集成方法,但不如稀疏组套索。在完整数据集中,稀疏组套索成功地识别出了一组变量,其中三个变量复制了早期的发现。

结论

这项分析表明,经验机器学习方法有潜力识别与结果真正相关的一小部分人口普查变量,并且这些变量可以通过经验方法复制。稀疏聚类回归模型表现最佳,因为它们可以识别出许多真正的阳性变量,同时控制假阳性的发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c313/7727197/16b257e2134d/12874_2020_1183_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c313/7727197/16b257e2134d/12874_2020_1183_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c313/7727197/16b257e2134d/12874_2020_1183_Fig1_HTML.jpg

相似文献

1
Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.社会环境数据中的变量选择:稀疏回归和树集成机器学习方法。
BMC Med Res Methodol. 2020 Dec 10;20(1):302. doi: 10.1186/s12874-020-01183-9.
2
Application of statistical machine learning in biomarker selection.统计机器学习在生物标志物选择中的应用。
Sci Rep. 2023 Oct 26;13(1):18331. doi: 10.1038/s41598-023-45323-9.
3
Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法
Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.
4
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning.变量选择与协变量和结果中的缺失数据:插补和机器学习。
Stat Methods Med Res. 2021 Dec;30(12):2651-2671. doi: 10.1177/09622802211046385. Epub 2021 Oct 25.
5
Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies.中介维度流行病学研究中识别结局真实预测因子的变量选择算法的不稳定性。
Epidemiology. 2021 May 1;32(3):402-411. doi: 10.1097/EDE.0000000000001340.
6
Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia.俄罗斯健康相关生活质量与药物滥用背景下线性回归中子集选择方法的比较
BMC Med Res Methodol. 2015 Aug 30;15:71. doi: 10.1186/s12874-015-0066-2.
7
Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models.在高维惩罚 Cox 回归模型中考虑分组预测变量或途径。
BMC Bioinformatics. 2020 Jul 2;21(1):277. doi: 10.1186/s12859-020-03618-y.
8
Combined Performance of Screening and Variable Selection Methods in Ultra-High Dimensional Data in Predicting Time-To-Event Outcomes.超高维数据中筛选和变量选择方法在预测事件发生时间结局方面的综合性能
Diagn Progn Res. 2018;2. doi: 10.1186/s41512-018-0043-4. Epub 2018 Sep 26.
9
An extensive experimental survey of regression methods.回归方法的广泛实验调查。
Neural Netw. 2019 Mar;111:11-34. doi: 10.1016/j.neunet.2018.12.010. Epub 2018 Dec 21.
10
Penalized Regression and Risk Prediction in Genome-Wide Association Studies.全基因组关联研究中的惩罚回归与风险预测
Stat Anal Data Min. 2013 Aug 1;6(4). doi: 10.1002/sam.11183.

引用本文的文献

1
Variable selection methods for descriptive modeling.用于描述性建模的变量选择方法。
PLoS One. 2025 Jun 2;20(6):e0321601. doi: 10.1371/journal.pone.0321601. eCollection 2025.
2
Does inclusion of neighborhood variables improve clinical risk prediction for advanced prostate cancer in Black and White men?纳入邻里变量是否能改善黑人和白人男性晚期前列腺癌的临床风险预测?
Urol Oncol. 2025 May;43(5):334.e17-334.e24. doi: 10.1016/j.urolonc.2025.02.021. Epub 2025 Mar 22.
3
An exploratory analysis of the impact of area-level exposome on geographic disparities in aggressive prostate cancer.

本文引用的文献

1
Progression-free survival as a surrogate for overall survival in oncology trials: a methodological systematic review.无进展生存期作为肿瘤学试验中总生存期的替代指标:方法学系统评价。
Br J Cancer. 2020 May;122(11):1707-1714. doi: 10.1038/s41416-020-0805-y. Epub 2020 Mar 26.
2
Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival.随机森林回归、分类和生存中变量重要性的标准误差和置信区间。
Stat Med. 2019 Feb 20;38(4):558-582. doi: 10.1002/sim.7803. Epub 2018 Jun 4.
3
A Neighborhood-Wide Association Study (NWAS): Example of prostate cancer aggressiveness.
基于暴露组学的区域水平分析对侵袭性前列腺癌地理差异的影响。
Sci Rep. 2024 Jul 29;14(1):16900. doi: 10.1038/s41598-024-63726-0.
4
Neighborhood Characteristics and Elevated Blood Pressure in Older Adults.社区特征与老年人高血压。
JAMA Netw Open. 2023 Sep 5;6(9):e2335534. doi: 10.1001/jamanetworkopen.2023.35534.
一项全社区关联研究(NWAS):前列腺癌侵袭性示例。
PLoS One. 2017 Mar 27;12(3):e0174548. doi: 10.1371/journal.pone.0174548. eCollection 2017.
4
Time to Review the Role of Surrogate End Points in Health Policy: State of the Art and the Way Forward.是时候审视替代终点在卫生政策中的作用了:现状与未来方向。
Value Health. 2017 Mar;20(3):487-495. doi: 10.1016/j.jval.2016.10.011. Epub 2016 Dec 22.
5
Do African-American men need separate prostate cancer screening guidelines?非裔美国男性是否需要单独的前列腺癌筛查指南?
BMC Urol. 2016 May 10;16(1):19. doi: 10.1186/s12894-016-0137-7.
6
Reclassification of genetic-based risk predictions as GWAS data accumulate.随着全基因组关联研究(GWAS)数据的积累,基于基因的风险预测的重新分类。
Genome Med. 2016 Feb 17;8(1):20. doi: 10.1186/s13073-016-0272-5.
7
Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies.用于全基因组关联研究内部验证的两步迭代重采样程序的评估
J Hum Genet. 2015 Dec;60(12):729-38. doi: 10.1038/jhg.2015.110. Epub 2015 Sep 17.
8
A new initiative on precision medicine.一项关于精准医学的新倡议。
N Engl J Med. 2015 Feb 26;372(9):793-5. doi: 10.1056/NEJMp1500523. Epub 2015 Jan 30.
9
Precision prevention of cancer.癌症的精准预防
Cancer Epidemiol Biomarkers Prev. 2014 Dec;23(12):2713-5. doi: 10.1158/1055-9965.EPI-14-1058. Epub 2014 Oct 31.
10
Integrating functional data to prioritize causal variants in statistical fine-mapping studies.在统计精细定位研究中整合功能数据以对因果变异进行优先级排序。
PLoS Genet. 2014 Oct 30;10(10):e1004722. doi: 10.1371/journal.pgen.1004722. eCollection 2014 Oct.