在大型流行病学数据集中检测独立关联：随机森林、增强回归树、传统和惩罚逻辑回归在识别与2009年甲型H1N1流感感染相关的独立因素方面的比较。

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections.

作者信息

Mansiaux Yohann, Carrat Fabrice

机构信息

INSERM, UMR_S 1136, Institut Pierre Louis d'Epidémiologie et de Santé Publique, F-75013 Paris, France.

出版信息

BMC Med Res Methodol. 2014 Aug 26;14:99. doi: 10.1186/1471-2288-14-99.

BACKGROUND

Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.

METHODS

We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.

RESULTS

Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO.

CONCLUSIONS

Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.

背景

大数据在流行病学领域正稳步增长。我们探讨了用于大数据分析的方法在检测暴露因素与健康结局之间独立关联方面的性能。

方法

我们在一个专门队列中抽取的498名受试者（14%感染）中，寻找303个协变量与流感感染之间的关联。使用两种数据挖掘方法，即随机森林（RF）和增强回归树（BRT）来检测独立关联；采用传统的逻辑回归框架（单变量后接多变量逻辑回归 - UFMLR）以及在多变量逻辑回归中带有惩罚项的最小绝对收缩和选择算子（LASSO），以实现协变量的稀疏选择。我们开发了置换检验来评估关联的统计学显著性。我们模拟了500个类似规模的数据集，以估计与这些方法相关的真阳性率（TPR）和假阳性率（FPR）。

结果

根据所使用的方法，有3至24个协变量（1% - 8%）被确定与流感感染相关。季节性前血凝抑制抗体滴度是所有方法都选择的唯一协变量，而266个（87%）协变量未被任何方法选中。在名义显著性水平为5%时，RF的TPR为85%，BRT为80%，UFMLR为26%至49%，LASSO为71%至78%。相反，RF和BRT的FPR为4%，UFMLR为9%至2%，LASSO为9%至4%。

结论

数据挖掘方法和LASSO应被视为在大型流行病学数据集中检测独立关联的有价值方法。

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献