Institute of Chemistry and Biochemistry, Freie Universität Berlin, Fabeckstrasse 36A, 14195, Berlin, Germany.
J Comput Aided Mol Des. 2011 Dec;25(12):1121-33. doi: 10.1007/s10822-011-9496-z. Epub 2011 Nov 20.
In silico methods characterizing molecular compounds with respect to pharmacologically relevant properties can accelerate the identification of new drugs and reduce their development costs. Quantitative structure-activity/-property relationship (QSAR/QSPR) correlate structure and physico-chemical properties of molecular compounds with a specific functional activity/property under study. Typically a large number of molecular features are generated for the compounds. In many cases the number of generated features exceeds the number of molecular compounds with known property values that are available for learning. Machine learning methods tend to overfit the training data in such situations, i.e. the method adjusts to very specific features of the training data, which are not characteristic for the considered property. This problem can be alleviated by diminishing the influence of unimportant, redundant or even misleading features. A better strategy is to eliminate such features completely. Ideally, a molecular property can be described by a small number of features that are chemically interpretable. The purpose of the present contribution is to provide a predictive modeling approach, which combines feature generation, feature selection, model building and control of overtraining into a single application called DemQSAR. DemQSAR is used to predict human volume of distribution (VD(ss)) and human clearance (CL). To control overtraining, quadratic and linear regularization terms were employed. A recursive feature selection approach is used to reduce the number of descriptors. The prediction performance is as good as the best predictions reported in the recent literature. The example presented here demonstrates that DemQSAR can generate a model that uses very few features while maintaining high predictive power. A standalone DemQSAR Java application for model building of any user defined property as well as a web interface for the prediction of human VD(ss) and CL is available on the webpage of DemPRED: http://agknapp.chemie.fu-berlin.de/dempred/ .
基于与药理学相关性质来对分子化合物进行特征描述的计算机方法可以加速新药的鉴定并降低其开发成本。定量构效关系(QSAR/QSPR)将分子化合物的结构和物理化学性质与特定的功能活性/性质相关联。通常会为化合物生成大量的分子特征。在许多情况下,生成的特征数量超过了具有已知属性值的可用分子化合物的数量,这些化合物可用于学习。在这种情况下,机器学习方法往往会过度拟合训练数据,即该方法会根据训练数据的非常特定的特征进行调整,而这些特征对于所考虑的属性并不具有代表性。这个问题可以通过减少不重要、冗余甚至误导性特征的影响来缓解。更好的策略是完全消除这些特征。理想情况下,一个分子属性可以用少量具有化学解释力的特征来描述。本研究的目的是提供一种预测建模方法,该方法将特征生成、特征选择、模型构建和过度训练控制整合到一个名为 DemQSAR 的单一应用程序中。DemQSAR 用于预测人体分布容积(VD(ss))和人体清除率(CL)。为了控制过度训练,使用了二次和线性正则化项。递归特征选择方法用于减少描述符的数量。预测性能与最近文献中报告的最佳预测一样好。这里展示的示例表明,DemQSAR 可以生成一个模型,该模型使用很少的特征,同时保持很高的预测能力。一个用于为任何用户定义的属性构建模型的独立的 DemQSAR Java 应用程序以及用于预测人体 VD(ss)和 CL 的网络界面可在 DemPRED 的网页上获得:http://agknapp.chemie.fu-berlin.de/dempred/ 。