Torres David J, Rouson Damain
Department of Mathematics and Physical Science, Northern New Mexico College, Española, NM 87532, USA.
Computer Languages and Systems Software Group, Lawrence Berkeley National Laboratory, Berkeley, California, USA.
Monte Carlo Methods Appl. 2024 Aug 8;30(4):331-363. doi: 10.1515/mcma-2024-2013. eCollection 2024 Dec.
Correlation coefficients and linear regression values computed from group averages can differ from correlation coefficients and linear regression values computed using individual scores. This observation known as the ecological fallacy often assumes that all the individual scores are available from a population. In many situations, one must use a sample from the larger population. In such cases, the computed correlation coefficient and linear regression values will depend on the sample that is chosen and the underlying sampling distribution. The sampling distribution of correlation coefficients and linear regression values for group averages will be identical to the sampling distribution for individuals for normally distributed variables for random samples drawn from infinitely large continuous distributions. However, data that is acquired in practice is often acquired when sampling without replacement from a finite population. Our objective is to demonstrate through Monte Carlo simulations that the sampling distributions for correlation and linear regression will also be similar for individuals and group averages when sampling without replacement from normally distributed variables. These simulations suggest that when a random sample from a population is selected, the correlation coefficients and linear regression values computed from individual scores will not be more accurate in estimating the entire population values compared to samples when group averages are used as long as the sample size is the same.
根据组均值计算出的相关系数和线性回归值可能与使用个体分数计算出的相关系数和线性回归值有所不同。这种被称为生态谬误的观察结果通常假定可以从总体中获取所有个体分数。在许多情况下,人们必须使用来自更大总体的样本。在这种情况下,计算出的相关系数和线性回归值将取决于所选择的样本以及潜在的抽样分布。对于从无限大的连续分布中抽取的随机样本,对于正态分布变量,组均值的相关系数和线性回归值的抽样分布将与个体的抽样分布相同。然而,在实际中获取的数据通常是在从有限总体中无放回抽样时获得的。我们的目标是通过蒙特卡罗模拟证明,当从正态分布变量中无放回抽样时,个体和组均值的相关和线性回归的抽样分布也将相似。这些模拟表明,当从总体中选择一个随机样本时,只要样本量相同,与使用组均值的样本相比,根据个体分数计算出的相关系数和线性回归值在估计总体值时并不更准确。