Department of Chemistry, Hong Kong University of Science & Technology, Clear Water Bay, Kowloon, Hong Kong, China.
Texas A&M Transportation Institute, Texas A&M University, College Station, TX 77843-3135, United States.
Sci Total Environ. 2021 May 1;767:144282. doi: 10.1016/j.scitotenv.2020.144282. Epub 2020 Dec 24.
The homoscedasticity assumption (the variance of the error term is the same across all the observations) is a key assumption in the ordinary linear squares (OLS) solution of a linear regression model. The validity of this assumption is examined for a multiple linear regression model used to determine the source contributions to the observed black carbon concentrations at 12 background monitoring sites across China using a hybrid modeling approach. Residual analysis from the traditional OLS method, which assumes that the error term is additive and normally distributed with a mean of zero, shows pronounced heteroscedasticity based on the Breusch-Pagan test for 11 datasets. Noticing that the atmospheric black carbon data are log-normally distributed, we make a new assumption that the error terms are multiplicative and log-normally distributed. When the coefficients of the multilinear regression model are determined using the maximum likelihood estimation (MLE), the distribution of the residuals in 8 out of the 12 datasets is in good accordance with the revised assumption. Furthermore, the MLE computation under this novel assumption could be proved mathematically identical to minimizing a log-scale objective function, which considerably reduces the complexity in the MLE calculation. The new method is further demonstrated to have clear advantages in numerical simulation experiments of a 5-variable multiple linear regression model using synthesized data with prescribed coefficients and lognormally distributed multiplicative errors. Under all 9 simulation scenarios, the new method yields the most accurate estimations of the regression coefficients and has significantly higher coverage probability (on average, 95% for all five coefficients) than OLS (79%) and weighted least squares (WLS, 72%) methods.
同方差假设(误差项的方差在所有观测值中相同)是线性回归模型中普通最小二乘法(OLS)解的关键假设。使用混合建模方法来确定中国 12 个背景监测站点观测到的黑碳浓度的来源贡献,对于所使用的多元线性回归模型,通过传统的 OLS 方法(假设误差项是加性的且正态分布,均值为零)进行残差分析,根据 Breusch-Pagan 检验,对于 11 个数据集,发现存在明显的异方差。注意到大气黑碳数据呈对数正态分布,我们做出新的假设,即误差项是乘性的且对数正态分布。使用最大似然估计(MLE)确定多元线性回归模型的系数时,在 12 个数据集的 8 个数据集中,残差的分布与修正后的假设非常吻合。此外,在这种新假设下进行的 MLE 计算可以被证明在数学上等同于最小化对数尺度的目标函数,这大大降低了 MLE 计算的复杂性。新方法在使用规定系数和对数正态分布乘性误差的合成数据进行的 5 变量多元线性回归模型的数值模拟实验中,进一步显示出明显的优势。在所有 9 个模拟场景下,新方法对回归系数的估计最为准确,且覆盖概率显著高于 OLS(79%)和加权最小二乘法(WLS,72%)方法(平均而言,所有五个系数的覆盖概率均为 95%)。