Basak Subhash C, Natarajan Ramanathan, Mills Denise, Hawkins Douglas M, Kraker Jessica J
Natural Resources Research Institute, Center for Water and Environment, University of Minnesota-Duluth, 55811, USA.
J Chem Inf Model. 2006 Jan-Feb;46(1):65-77. doi: 10.1021/ci050215y.
Quantitative structure-activity relationship (QSAR) modelers often encounter the problem of multicollinearity owing to the availability of large numbers of computable molecular descriptors. Sparsity of the variables while using descriptors such as atom pairs increases the complexity. Three different predictor-thinning methods, namely, a modified Gram-Schmidt algorithm, a marginal soft thresholding algorithm, and LASSO (least absolute shrinkage and selection operator), were utilized to reduce the number of descriptors prior to developing linear models. Juvenile hormone (JH) activity of 304 compounds on Culex pipiens larvae was taken as the model data set, and predictor trimming of a large number of diverse descriptors comprising 268 global molecular descriptors (topostructural, topochemical, and geometrical), 13 quantum chemical descriptors, and 915 atom pairs (substructural counts) was applied prior to linear regression by the ridge regression method. The data set (N = 304) was split into five calibration data sets of random samples of sizes 60/110/160/210/260, and the remaining 244/194/144/94/44 compounds were used for validations. LASSO was not found to be a very effective method in handling a large set of descriptors because the number of predictors retained could not exceed the number of observations. The results indicated that the modified Gram-Schmidt algorithm could be used to trim the number of predictors in the global molecular descriptor set where collinearity of the descriptors was the major concern. On the contrary, the soft thresholding approach was found to be an effective tool in subset selection from a diverse set of descriptors having both sparsity and multicollinearity, as in the case of the combined set of atom pairs and global molecular descriptors. The final model developed after variable selection was dominated more by atom pairs, which indicated the important structural moieties that affect JH activity of the compounds. The success of the method reiterates the fact that QSAR or quantitative structure-property relationship (QSPR) models can be developed for a diverse set of compounds using properly parametrized and diverse sets of descriptors, of course, with the selection of the appropriate statistical tools.
由于存在大量可计算的分子描述符,定量构效关系(QSAR)建模者经常遇到多重共线性问题。在使用诸如原子对之类的描述符时,变量的稀疏性增加了复杂性。在开发线性模型之前,采用了三种不同的预测变量精简方法,即改进的Gram-Schmidt算法、边际软阈值算法和LASSO(最小绝对收缩和选择算子)来减少描述符的数量。以304种化合物对致倦库蚊幼虫的保幼激素(JH)活性作为模型数据集,并在通过岭回归方法进行线性回归之前,对包含268个全局分子描述符(拓扑结构、拓扑化学和几何)、13个量子化学描述符和915个原子对(子结构计数)的大量不同描述符进行预测变量修剪。数据集(N = 304)被随机分成五个校准数据集,大小分别为60/110/160/210/260,其余244/194/144/94/44种化合物用于验证。发现LASSO在处理大量描述符时不是一种非常有效的方法,因为保留的预测变量数量不能超过观测值的数量。结果表明,改进的Gram-Schmidt算法可用于修剪全局分子描述符集中的预测变量数量,其中描述符的共线性是主要关注点。相反,软阈值方法被发现是从具有稀疏性和多重共线性的不同描述符集中进行子集选择的有效工具,如在原子对和全局分子描述符的组合集的情况下。变量选择后开发的最终模型更多地由原子对主导,这表明影响化合物JH活性的重要结构部分。该方法的成功再次强调了这样一个事实,即使用适当参数化和多样化的描述符集,当然,结合选择合适的统计工具,可以为各种化合物开发QSAR或定量构性关系(QSPR)模型。