From the Department of Epidemiology, Emory University.
Department of Epidemiology, University of Pittsburgh School of Public Health, Atlanta, GA.
Epidemiology. 2024 Nov 1;35(6):779-786. doi: 10.1097/EDE.0000000000001785. Epub 2024 Aug 16.
The use of machine learning to estimate exposure effects introduces a dependence between the results of an empirical study and the value of the seed used to fix the pseudo-random number generator.
We used data from 10,038 pregnant women and a 10% subsample (N = 1004) to examine the extent to which the risk difference for the relation between fruit and vegetable consumption and preeclampsia risk changes under different seed values. We fit an augmented inverse probability weighted estimator with two Super Learner algorithms: a simple algorithm including random forests and single-layer neural networks and a more complex algorithm with a mix of tree-based, regression-based, penalized, and simple algorithms. We evaluated the distributions of risk differences, standard errors, and P values that result from 5000 different seed value selections.
Our findings suggest important variability in the risk difference estimates, as well as an important effect of the stacking algorithm used. The interquartile range width of the risk differences in the full sample with the simple algorithm was 13 per 1000. However, all other interquartile ranges were roughly an order of magnitude lower. The medians of the distributions of risk differences differed according to the sample size and the algorithm used.
Our findings add another dimension of concern regarding the potential for "p-hacking," and further warrant the need to move away from simplistic evidentiary thresholds in empirical research. When empirical results depend on pseudo-random number generator seed values, caution is warranted in interpreting these results.
使用机器学习来估计暴露效应会引入实证研究结果与用于固定伪随机数生成器的种子值之间的依赖性。
我们使用了 10038 名孕妇的数据和一个 10%的子样本(N=1004),以检验在不同种子值下,水果和蔬菜消费与子痫前期风险之间关系的风险差异的变化程度。我们使用了两种 Super Learner 算法拟合了增强逆概率加权估计器:一种简单算法,包括随机森林和单层神经网络,另一种更复杂的算法,包含了基于树、基于回归、惩罚和简单算法的混合。我们评估了 5000 种不同种子值选择所产生的风险差异、标准误差和 P 值的分布。
我们的研究结果表明,风险差异估计值存在重要的可变性,同时也受到所使用的堆叠算法的重要影响。在使用简单算法的全样本中,风险差异的四分位间距为每 1000 个 13 个。然而,所有其他四分位间距都大约低一个数量级。根据样本量和使用的算法,风险差异分布的中位数有所不同。
我们的研究结果增加了对“假阳性”的潜在可能性的另一个关注维度,进一步证明需要在实证研究中摒弃简单的证据阈值。当实证结果取决于伪随机数生成器种子值时,需要谨慎解释这些结果。