Department of Statistics, TU Dortmund University, Dortmund, North Rhine-Westphalia, Germany.
Division of Biostatistics, German Cancer Research Center, Heidelberg, Baden-Wuerttemberg, Germany.
PLoS One. 2024 May 15;19(5):e0299989. doi: 10.1371/journal.pone.0299989. eCollection 2024.
Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for both researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model, is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the outcome-generating model (OGM) were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation studies for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended.
模拟是评估和比较统计方法的重要工具。因此,对于开发新方法的研究人员和面临选择最合适方法的从业者来说,如何设计公平和中立的模拟研究具有重要意义。术语“模拟”通常是指参数模拟,即使用由伪随机数组成的人工数据进行计算机实验。Plasmode 模拟,即使用从实际数据集重新采样特征数据的组合并使用已知用户选择的生成模型生成目标变量的计算机实验,是一种经常声称可以产生更真实数据的替代方法。我们将参数模拟和 Plasmode 模拟进行比较,以估计线性回归中最小二乘估计器(LSE)的均方误差(MSE)为例。如果真实的潜在数据生成过程(DGP)和生成模型(OGM)已知,那么从估计 MSE 的角度来看,参数模拟显然是最佳选择。然而,在现实中,这两者通常是未知的,因此研究人员必须做出假设:在 OGM 的 Plasmode 模拟研究中,在 DGP 和 OGM 的参数模拟中。很可能,这些假设并不完全反映事实。在这里,我们旨在找出与真实 DGP 和真实 OGM 的假设偏离如何影响 LSE 的 MSE 估计的参数模拟和 Plasmode 模拟的性能,以及在哪些情况下哪种模拟类型更可取。我们的结果表明,首选的模拟方法取决于许多因素,包括特征的数量,以及参数模拟的假设与真实 DGP 的差异程度。此外,Plasmode 中使用的重采样策略也会影响结果。特别是,可以推荐使用小采样比例的子采样。