Guha Rajarshi, Dutta Debojyoti, Jurs Peter C, Chen Ting
Department of Chemistry, Pennsylvania State University, University Park, Pennsylvania 16802, USA.
J Chem Inf Model. 2006 Jul-Aug;46(4):1836-47. doi: 10.1021/ci060064e.
Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.
传统的定量构效关系(QSAR)模型旨在捕捉数据集中存在的全局构效趋势。在许多情况下,可能存在一组分子,它们表现出与活性或非活性相关的特定特征集。这样一组特征可以说是代表了局部构效关系。传统的QSAR模型可能无法识别这种局部关系。在这项工作中,我们研究了局部懒惰回归(LLR)的应用,它使用查询分子的局部邻域来获得预测,而不是考虑整个数据集。这种建模方法对于非常大的数据集特别有用,因为无需构建先验模型。我们将该技术应用于三个生物学数据集。在第一个案例中,外部预测集的均方根误差(RMSE)为0.94对数单位,而全局模型为0.92对数单位。然而,LLR能够以更高的准确度表征一组特定的异常分子(全局模型为0.70对数单位,LLR为0.64对数单位)。对于第二个数据集,LLR技术使外部预测集的RMSE从0.36对数单位降至0.31对数单位。在第三个案例中,我们得到的RMSE为2.01对数单位,而全局模型为2.16对数单位。在所有案例中,与全局模型相比,LLR导致一些观测值预测效果不佳。我们对观察到这种情况的原因以及局部回归方法可能的改进进行了分析。