Hazra Avijit, Gogtay Nithya
Department of Pharmacology, Institute of Postgraduate Medical Education and Research, Kolkata, West Bengal, India.
Department of Clinical Pharmacology, Seth GS Medical College and KEM Hospital, Mumbai, Maharashtra, India.
Indian J Dermatol. 2016 Nov-Dec;61(6):593-601. doi: 10.4103/0019-5154.193662.
Correlation and linear regression are the most commonly used techniques for quantifying the association between two numeric variables. Correlation quantifies the strength of the linear relationship between paired variables, expressing this as a correlation coefficient. If both variables and are normally distributed, we calculate Pearson's correlation coefficient (). If normality assumption is not met for one or both variables in a correlation analysis, a rank correlation coefficient, such as Spearman's rho (ρ) may be calculated. A hypothesis test of correlation tests whether the linear relationship between the two variables holds in the underlying population, in which case it returns a < 0.05. A 95% confidence interval of the correlation coefficient can also be calculated for an idea of the correlation in the population. The value denotes the proportion of the variability of the dependent variable that can be attributed to its linear relation with the independent variable and is called the coefficient of determination. Linear regression is a technique that attempts to link two correlated variables and in the form of a mathematical equation ( = + ), such that given the value of one variable the other may be predicted. In general, the method of least squares is applied to obtain the equation of the regression line. Correlation and linear regression analysis are based on certain assumptions pertaining to the data sets. If these assumptions are not met, misleading conclusions may be drawn. The first assumption is that of linear relationship between the two variables. A scatter plot is essential before embarking on any correlation-regression analysis to show that this is indeed the case. Outliers or clustering within data sets can distort the correlation coefficient value. Finally, it is vital to remember that though strong correlation can be a pointer toward causation, the two are not synonymous.
相关性分析和线性回归是量化两个数值变量之间关联最常用的技术。相关性分析用于量化配对变量之间线性关系的强度,并将其表示为相关系数。如果变量(X)和(Y)都呈正态分布,我们计算皮尔逊相关系数((r))。如果在相关性分析中一个或两个变量不满足正态性假设,则可以计算秩相关系数,例如斯皮尔曼等级相关系数((\rho))。相关性的假设检验用于检验两个变量之间的线性关系在总体中是否成立,在这种情况下,它返回(p < 0.05)。还可以计算相关系数的95%置信区间,以了解总体中的相关性情况。(r^2)表示因变量(Y)的变异性中可归因于其与自变量(X)的线性关系的比例,称为决定系数。线性回归是一种试图以数学方程((Y = a + bX))的形式将两个相关变量(X)和(Y)联系起来的技术,这样给定一个变量的值就可以预测另一个变量的值。一般来说,应用最小二乘法来获得回归线的方程。相关性分析和线性回归分析基于与数据集相关的某些假设。如果这些假设不成立,可能会得出误导性的结论。第一个假设是两个变量之间存在线性关系。在进行任何相关性 - 回归分析之前,绘制散点图至关重要,以表明确实如此。数据集中的异常值或聚类可能会扭曲相关系数值。最后,必须记住,虽然强相关性可能指向因果关系,但两者并不等同。