Faculty of Statistics, Institute of Mathematical Statistics and Applications in Industry, Technical University of Dortmund, Dortmund 44227, Germany.
Bioinformatics. 2020 May 1;36(10):3099-3106. doi: 10.1093/bioinformatics/btaa082.
Imputation procedures in biomedical fields have turned into statistical practice, since further analyses can be conducted ignoring the former presence of missing values. In particular, non-parametric imputation schemes like the random forest have shown favorable imputation performance compared to the more traditionally used MICE procedure. However, their effect on valid statistical inference has not been analyzed so far. This article closes this gap by investigating their validity for inferring mean differences in incompletely observed pairs while opposing them to a recent approach that only works with the given observations at hand.
Our findings indicate that machine-learning schemes for (multiply) imputing missing values may inflate type I error or result in comparably low power in small-to-moderate matched pairs, even after modifying the test statistics using Rubin's multiple imputation rule. In addition to an extensive simulation study, an illustrative data example from a breast cancer gene study has been considered.
The corresponding R-code can be accessed through the authors and the gene expression data can be downloaded at www.gdac.broadinstitute.org.
Supplementary data are available at Bioinformatics online.
在生物医学领域,插补程序已经成为统计实践,因为可以忽略先前存在的缺失值进行进一步分析。特别是,与更传统使用的 MICE 程序相比,非参数插补方案(如随机森林)已显示出有利的插补性能。然而,它们对有效统计推断的影响尚未得到分析。本文通过调查它们在推断不完全观察对的均值差异时的有效性来填补这一空白,同时反对仅使用手头现有观测值的最近方法。
我们的研究结果表明,对于(多重)插补缺失值的机器学习方案,即使使用 Rubin 的多重插补规则修改了检验统计量,也可能会导致Ⅰ型错误膨胀或在小到中等匹配对中产生可比低功效,甚至在修改了检验统计量后也是如此。除了广泛的模拟研究外,还考虑了一个来自乳腺癌基因研究的说明性数据示例。
相应的 R 代码可以通过作者获得,基因表达数据可以在 www.gdac.broadinstitute.org 下载。
补充资料可在 Bioinformatics 在线获取。