Mansiaux Yohann, Carrat Fabrice
INSERM, UMR_S 1136, Institut Pierre Louis d'Epidémiologie et de Santé Publique, F-75013 Paris, France.
BMC Med Res Methodol. 2014 Aug 26;14:99. doi: 10.1186/1471-2288-14-99.
Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.
We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.
Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO.
Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
大数据在流行病学领域正稳步增长。我们探讨了用于大数据分析的方法在检测暴露因素与健康结局之间独立关联方面的性能。
我们在一个专门队列中抽取的498名受试者(14%感染)中,寻找303个协变量与流感感染之间的关联。使用两种数据挖掘方法,即随机森林(RF)和增强回归树(BRT)来检测独立关联;采用传统的逻辑回归框架(单变量后接多变量逻辑回归 - UFMLR)以及在多变量逻辑回归中带有惩罚项的最小绝对收缩和选择算子(LASSO),以实现协变量的稀疏选择。我们开发了置换检验来评估关联的统计学显著性。我们模拟了500个类似规模的数据集,以估计与这些方法相关的真阳性率(TPR)和假阳性率(FPR)。
根据所使用的方法,有3至24个协变量(1% - 8%)被确定与流感感染相关。季节性前血凝抑制抗体滴度是所有方法都选择的唯一协变量,而266个(87%)协变量未被任何方法选中。在名义显著性水平为5%时,RF的TPR为85%,BRT为80%,UFMLR为26%至49%,LASSO为71%至78%。相反,RF和BRT的FPR为4%,UFMLR为9%至2%,LASSO为9%至4%。
数据挖掘方法和LASSO应被视为在大型流行病学数据集中检测独立关联的有价值方法。