Topliss J G, Edwards R P
J Med Chem. 1979 Oct;22(10):1238-44. doi: 10.1021/jm00196a017.
Multiple regression analysis is a basic statistical tool used for QSAR studies in drug design. However, there is a risk or arriving at fortuitous correlations when too many variables are screened relative to the number of available observations. In this regard, a critical distinction must be made between the number of variables screened for possible correlation and the number which actually appear in the regression equation. Using a modified Fortran stepwise multiple-regression analysis program, simulated QSAR studies employing random numbers were run for many different combinations of screened variables and observations. Under certain conditions, a substantial incidence of correlations with high r2 values were found, although the overall degree of chance correlation noted was less than that reported in a previous study. Analysis of the results has provided a basis for making judgements concerning the level of risk of encountering chance correlations for a wide range of combinations of observations and screened variables in QSAR studies using multiple-regression analysis. For illustrative purposes, some examples involving published QSAR studies have been considered and the reported correlations shown to be less significant than originally presented through the influence of unrecognized chance factors.
多元回归分析是药物设计中定量构效关系(QSAR)研究使用的一种基本统计工具。然而,当相对于可用观测值的数量筛选了过多变量时,存在得出偶然相关性的风险。在这方面,必须明确区分筛选以寻找可能相关性的变量数量与实际出现在回归方程中的变量数量。使用一个修改后的Fortran逐步多元回归分析程序,针对筛选变量和观测值的许多不同组合运行了采用随机数的模拟QSAR研究。在某些条件下,发现了大量具有高r2值的相关性,尽管所观察到的偶然相关性总体程度低于先前一项研究所报告的程度。结果分析为判断在使用多元回归分析的QSAR研究中,对于广泛的观测值和筛选变量组合遇到偶然相关性的风险水平提供了依据。为了说明起见,考虑了一些涉及已发表的QSAR研究的例子,结果表明,由于未被认识到的偶然因素的影响,所报告的相关性不如最初呈现的那样显著。