[数据转换与归一化]

[Data transformation and normalization].

作者信息

Nishida Toshinobu

机构信息

Biomedical Laboratory Sciences, Institute of Health Biosciences, The University of Tokushima Graduate School, Tokushima 770-8509, Japan.

出版信息

Rinsho Byori. 2010 Oct;58(10):990-7.

PMID:21077289

Abstract

When we analyze measured values by statistical techniques, we usually use a parametric method. It is necessary for measured values to show a normal distribution. Therefore, we must confirm that the distribution is normal, whereby a histogram shows that the distribution of data points is symmetrical above and below the mean. When measured values do not show a normal distribution, power transformation (square root transformation, logarithmic transformation) must be performed. We can evaluate the presence of normality by three methods: viewing a histogram, through skewness and kurtosis values, and Kolmogorov-Smirnov method's p-value. Because Kolmogorov-Smirnov's method is influenced by outliers, attention is necessary regarding the interpretation of p-values. For example, if we calculate reference intervals from clinical testing data, we calculate a parameter(mean and standard deviation) and set X +/- 2SD to upper and lower limits, respectively. When we evaluate the reference intervals, a range including 95% of the central part of sample is important. Identification of the distribution type based on the diagonal linear pattern of normal quantile plots may be the most reliable in my experience.

摘要

当我们通过统计技术分析测量值时，通常会使用参数方法。测量值必须呈现正态分布。因此，我们必须确认分布是否为正态，通过直方图可以看出数据点的分布在均值上下是对称的。当测量值不呈现正态分布时，必须进行幂变换（平方根变换、对数变换）。我们可以通过三种方法评估正态性的存在：查看直方图、通过偏度和峰度值以及柯尔莫哥洛夫-斯米尔诺夫方法的p值。由于柯尔莫哥洛夫-斯米尔诺夫方法受异常值影响，因此在解释p值时需要注意。例如，如果我们从临床检测数据计算参考区间，我们会计算一个参数（均值和标准差），并分别将X +/- 2SD设置为上限和下限。当我们评估参考区间时，包含样本中心部分95%的范围很重要。根据我的经验，基于正态分位数图的对角线线性模式识别分布类型可能是最可靠的。