Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Reimann 383, 333 Cottman Ave, Philadelphia, PA, 19111, USA.
Cancer Prevention and Control, Fox Chase Cancer Center, Young Pavilion, 333 Cottman Ave, Philadelphia, PA, 19111, USA.
BMC Med Res Methodol. 2020 Dec 10;20(1):302. doi: 10.1186/s12874-020-01183-9.
Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.
We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.
In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.
This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.
美国人口普查获得的社会环境数据是了解健康差异的重要资源,但很少有分析利用完整的数据集。将完整数据纳入分析的一个障碍是缺乏可靠的变量选择建议,研究人员通常会手动选择少数几个变量。因此,我们评估了经验机器学习方法识别与健康结果真正相关的社会环境因素的能力。
我们比较了几种流行的机器学习方法,包括惩罚回归(例如lasso、弹性网络)和树集成方法。通过模拟,我们评估了这些方法在最小化假阳性结果(10 个真实关联,1000 个总变量)的情况下识别与二分类和连续结果真正相关的人口普查变量的能力。我们将最有前途的方法应用于与前列腺癌登记数据(n=76186 例)相关的完整人口普查数据(p=14663 个变量),以识别与晚期前列腺癌相关的社会环境因素。
在模拟中,我们发现弹性网络识别出了许多真正的阳性变量,而lasso 则很好地控制了假阳性。使用准确性的综合衡量标准,基于 Spearman 相关性的层次聚类和稀疏组套索回归表现最佳。贝叶斯自适应回归树优于其他树集成方法,但不如稀疏组套索。在完整数据集中,稀疏组套索成功地识别出了一组变量,其中三个变量复制了早期的发现。
这项分析表明,经验机器学习方法有潜力识别与结果真正相关的一小部分人口普查变量,并且这些变量可以通过经验方法复制。稀疏聚类回归模型表现最佳,因为它们可以识别出许多真正的阳性变量,同时控制假阳性的发现。