Kamath Asha, Poojari Satyanarayana, Varsha K
Department of Applied Statistics and Data Science, Prasanna School of Public Health, Manipal Academy of Higher Education Manipal, Udupi, Karnataka, India.
BMC Med Res Methodol. 2025 Sep 1;25(1):206. doi: 10.1186/s12874-025-02641-y.
Many statistical methods used in public health research, namely t-tests, ANOVA correlation and regression, rely on the assumption of normality. Violation of the normality assumption can severely lead to biased parameter estimates, reduced test power, and impact the reliability and validity of the findings, impacting the real-world evidence. An attempt to provide guidelines for choice of appropriate tests for assessment in public health data analytics is being made in this article.
This study aims to compare the performance of 13 commonly available normality tests in various software's, namely Shapiro-Wilk, Shapiro-Francia (Regression-Based tests), Lilliefors, Cramer Von Mises, Anderson-Darling (Empirical distribution-based test), Jarque-Bera, Adjusted Jarque Bera Test, Robust Jarque-Bera, D'Agostino & Pearson, D'Agostino Skewness, D'Agostino Kurtosis, Gel Miao Gastwirth (Moment-Based test), and Pearson Chi-Square (Chi-square-based test). These tests were evaluated based on empirical Type I error and power across varying sample sizes, skewness, and kurtosis using Monte Carlo simulations with non-normal data generated via the Fleishman method, reflecting slight to significant deviations in terms of skewness and kurtosis.
For moderately skewed data with low kurtosis, the D'Agostino Skewness and Shapiro-Wilk tests perform better across all sample sizes while Robust and Adjusted Jarque-Bera tests are preferable at higher kurtosis. In highly skewed data, Shapiro-Wilk is most effective, with Shapiro-Francia and Anderson-Darling improving with larger samples. For symmetric data, RJB and GMG are robust choices, with GMG preferred at higher kurtosis. Findings from two real-world datasets also support the simulation results.
Performance of Normality tests are significantly influenced by sample size, skewness, and kurtosis. The findings of this study contribute to improving statistical practices in public health research by providing a practical, evidence-based checklist for selecting appropriate normality tests based on these key sample characteristics.
公共卫生研究中使用的许多统计方法,即t检验、方差分析、相关性分析和回归分析,都依赖于正态性假设。违反正态性假设可能会严重导致参数估计有偏差、检验效能降低,并影响研究结果的可靠性和有效性,进而影响实际证据。本文试图为公共卫生数据分析中选择合适的评估检验提供指导方针。
本研究旨在比较13种常见的正态性检验在各种软件中的性能,即夏皮罗-威尔克检验、夏皮罗-弗朗西亚检验(基于回归的检验)、利利福斯检验、克拉默-冯-米塞斯检验、安德森-达林检验(基于经验分布的检验)、雅克-贝拉检验、调整后的雅克-贝拉检验、稳健的雅克-贝拉检验、达戈斯蒂诺与皮尔逊检验、达戈斯蒂诺偏度检验、达戈斯蒂诺峰度检验、格尔-苗-加斯维思检验(基于矩的检验)和皮尔逊卡方检验(基于卡方的检验)。使用通过弗莱什曼方法生成的非正态数据,通过蒙特卡罗模拟,基于不同样本量、偏度和峰度下的经验第一类错误和检验效能,对这些检验进行评估,反映偏度和峰度方面从轻微到显著的偏差。
对于低峰度的中度偏态数据,达戈斯蒂诺偏度检验和夏皮罗-威尔克检验在所有样本量下表现更好,而稳健的和调整后的雅克-贝拉检验在较高峰度时更可取。在高度偏态数据中,夏皮罗-威尔克检验最有效,随着样本量增大,夏皮罗-弗朗西亚检验和安德森-达林检验效果更好。对于对称数据,稳健的雅克-贝拉检验和格尔-苗-加斯维思检验是稳健的选择,在较高峰度时更倾向于格尔-苗-加斯维思检验。两个实际数据集的结果也支持模拟结果。
正态性检验的性能受样本量、偏度和峰度的显著影响。本研究结果有助于通过提供一份基于这些关键样本特征选择合适正态性检验的实用、循证清单,改进公共卫生研究中的统计实践。