Gasko R
Vzajomna zdravotna poistovna Dovera, Kosice, Slovakia.
Bratisl Lek Listy. 2003;104(1):36-9.
When testing a hypothesis statistically, a principle is generally accepted that exact p values shall be stated in the treatise. Researchers have the choice of many statistical computer programmes with implemented hypothesis tests. Are exact p values calculated in the same statistical tests by diverse statistical programmes identical?
The respective zero hypothesis were tested in 5 artificially created data sets by the parametric unpaired t-test, non-parametric Mann-Whitney test, two-tailed F-test. The calculations were carried out by the following programmes: Statistix, version 7.1 (source www.statistix.com), Analyse-it, version 1.62 (source www.analyse-it.com), MedCalc, version 6.14 (source www.medcalc.be). The p values in the same tests were mutually compared.
All three programmes calculated identical exact p values for the t-test. In the remaining two tests in case of 26 out of 44 calculations (59.1 per cent; 95 per cent confidence interval 43-73 per cent) different p values were calculated. The greatest difference was 18.35 per cent. In two cases the values oscillated about 0.05 and this fact caused essentially different interpretation of results.
Using the significance test in the biomedical research has been subject to criticism for a longer period of time. The testing of the zero hypothesis on the arbitrary significance level of 0.05 should be substituted by other methods. Our discoveries should undermine the ungrounded belief of the users of statistical tests--physicians in ununderminable accuracy of mathematical procedures. The use of confidence intervals deems much more suitable although there are objections against them as well. (Tab. 4, Fig. 1, Ref. 19.).
在对假设进行统计学检验时,一般公认的原则是应在论文中陈述确切的p值。研究人员可以选择许多已实施假设检验的统计计算机程序。不同的统计程序在相同的统计检验中计算出的确切p值是否相同?
通过参数非配对t检验、非参数曼-惠特尼检验、双尾F检验对5个人工创建的数据集检验各自的零假设。计算由以下程序进行:Statistix 7.1版(来源:www.statistix.com)、Analyse-it 1.62版(来源:www.analyse-it.com)、MedCalc 6.14版(来源:www.medcalc.be)。对相同检验中的p值进行相互比较。
所有三个程序对t检验计算出相同的确切p值。在其余两项检验中,44次计算中有26次(59.1%;95%置信区间43 - 73%)计算出不同的p值。最大差异为18.35%。在两种情况下,值在0.05左右波动,这一事实导致对结果的解释有本质不同。
在生物医学研究中使用显著性检验长期以来一直受到批评。应采用其他方法替代在任意显著性水平0.05上对零假设的检验。我们的发现应会削弱统计检验使用者(医生)对数学程序不可动摇的准确性的毫无根据的信念。使用置信区间虽然也有人反对,但似乎更为合适。(表4,图1,参考文献19)