Horvath Dragos, Bonachera Fanny, Solov'ev Vitaly, Gaudin Cédric, Varnek Alexander
UGSF-UMR 8576 CNRS/USTL, Université de Lille 1, Bât C9., 59650 Villeneuve d'Ascq, France.
J Chem Inf Model. 2007 May-Jun;47(3):927-39. doi: 10.1021/ci600476r. Epub 2007 May 5.
Descriptor selection in QSAR typically relies on a set of upfront working hypotheses in order to boil down the initial descriptor set to a tractable size. Stepwise regression, computationally cheap and therefore widely used in spite of its potential caveats, is most aggressive in reducing the effectively explored problem space by adopting a greedy variable pick strategy. This work explores an antipodal approach, incarnated by an original Genetic Algorithm (GA)-based Stochastic QSAR Sampler (SQS) that favors unbiased model search over computational cost. Independent of a priori descriptor filtering and, most important, not limited to linear models only, it was benchmarked against the ISIDA Stepwise Regression (SR) tool. SQS was run under various premises, varying the training/validation set splitting scheme, the nonlinearity policy, and the used descriptors. With the considered three anti-HIV compound sets, repeated SQS runs generate sometimes poorly overlapping but nevertheless equally well validating model sets. Enabling SQS to apply nonlinear descriptor transformations increases the problem space: nevertheless, nonlinear models tend to be more robust validators. Model validation benchmarking showed SQS to match the performance of SR or outperform it in cases when the upfront simplifications of SR "backfire", even though the robust SR got trapped in local minima only once in six cases. Consensus models from large SQS model sets validate well--but not outstandingly better than SR consensus equations. SQS is thus a robust QSAR building tool according to standard validation tests against external sets of compounds (of same families as used for training), but many of its benefits/drawbacks may yet not be revealed by such tests. SQS results are a challenge to the traditional way to interpret and exploit QSAR: how to deal with thousands of well validating models, nonetheless providing potentially diverging applicability ranges and predicted values for external compounds. SR does not impose such burden on the user, but is "betting" on a single equation or a narrow consensus model to behave properly in virtual screening a sound strategy? By posing these questions, this article will hopefully act as an incentive for the long-haul studies needed to get them answered.
定量构效关系(QSAR)中的描述符选择通常依赖于一组预先设定的工作假设,以便将初始描述符集精简到易于处理的规模。逐步回归计算成本低,尽管存在潜在问题但仍被广泛使用,它通过采用贪婪变量选择策略,在减少有效探索的问题空间方面最为激进。这项工作探索了一种相反的方法,由基于遗传算法(GA)的原始随机QSAR采样器(SQS)体现,该方法更倾向于无偏模型搜索而非计算成本。独立于先验描述符过滤,并且最重要的是,不仅限于线性模型,它以ISIDA逐步回归(SR)工具为基准进行测试。SQS在各种前提下运行,改变训练/验证集划分方案、非线性策略和使用的描述符。对于所考虑的三个抗HIV化合物集,重复运行SQS有时会生成重叠性较差但验证效果同样良好的模型集。使SQS能够应用非线性描述符变换会增加问题空间:然而,非线性模型往往是更稳健的验证器。模型验证基准测试表明,在SR的前期简化“适得其反”的情况下,SQS能够与SR的性能相匹配或优于SR,尽管稳健的SR在六种情况中仅一次陷入局部最小值。来自大型SQS模型集的共识模型验证良好,但并不比SR共识方程显著更好。因此,根据针对外部化合物集(与用于训练的化合物属于同一家族)的标准验证测试,SQS是一种稳健的QSAR构建工具,但此类测试可能尚未揭示其许多优点/缺点。SQS的结果对解释和利用QSAR的传统方式构成了挑战:如何处理数以千计验证良好的模型,尽管这些模型可能为外部化合物提供潜在不同的适用范围和预测值。SR不会给用户带来这样的负担,但它“押注”于单个方程或狭义的共识模型在虚拟筛选中能正常运行——这是一个明智的策略吗?通过提出这些问题,本文有望激发为回答这些问题所需的长期研究。