Balfer Jenny, Bajorath Jürgen
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113, Bonn, Germany.
PLoS One. 2015 Mar 5;10(3):e0119301. doi: 10.1371/journal.pone.0119301. eCollection 2015.
Support vector machines are a popular machine learning method for many classification tasks in biology and chemistry. In addition, the support vector regression (SVR) variant is widely used for numerical property predictions. In chemoinformatics and pharmaceutical research, SVR has become the probably most popular approach for modeling of non-linear structure-activity relationships (SARs) and predicting compound potency values. Herein, we have systematically generated and analyzed SVR prediction models for a variety of compound data sets with different SAR characteristics. Although these SVR models were accurate on the basis of global prediction statistics and not prone to overfitting, they were found to consistently mispredict highly potent compounds. Hence, in regions of local SAR discontinuity, SVR prediction models displayed clear limitations. Compared to observed activity landscapes of compound data sets, landscapes generated on the basis of SVR potency predictions were partly flattened and activity cliff information was lost. Taken together, these findings have implications for practical SVR applications. In particular, prospective SVR-based potency predictions should be considered with caution because artificially low predictions are very likely for highly potent candidate compounds, the most important prediction targets.
支持向量机是生物学和化学领域许多分类任务中常用的机器学习方法。此外,支持向量回归(SVR)变体被广泛用于数值性质预测。在化学信息学和药物研究中,SVR已成为可能是用于非线性构效关系(SAR)建模和预测化合物效价的最流行方法。在此,我们系统地生成并分析了针对具有不同SAR特征的各种化合物数据集的SVR预测模型。尽管这些SVR模型基于全局预测统计是准确的且不易过度拟合,但发现它们始终会错误预测高效能化合物。因此,在局部SAR不连续的区域,SVR预测模型显示出明显的局限性。与化合物数据集观察到的活性景观相比,基于SVR效价预测生成的景观部分变平,活性悬崖信息丢失。综上所述,这些发现对SVR的实际应用具有启示意义。特别是,基于SVR的前瞻性效价预测应谨慎考虑,因为对于高效能候选化合物(最重要的预测目标),很可能会出现人为的低预测。