Department of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan 45137-66731, Iran.
J Chem Inf Model. 2010 Dec 27;50(12):2055-66. doi: 10.1021/ci100169p. Epub 2010 Nov 11.
This study is an implementation of a robust jackknife-based descriptor selection procedure assisted with Gram-Schmidt orthogonalization. Selwood data including 31 molecules and 53 descriptors was considered in this study. Both multiple linear regression (MLR) and partial least squares (PLS) regression methods were applied during the jackknife procedures, and the desired results were obtained when using PLS regression on both autoscaled and orthogonalized data sets. Having used the Gram-Schmidt technique, descriptors were all orthogonalized, and their number was reduced to 30. A reproducible set of descriptors was obtained when PLS-jackknife was applied to the Gram-Schmidt orthogonalized data. The simple statistical t-test was applied to determine the significance of the obtained regression coefficients from jackknife resampling.Increasing the sample size, descriptors, based on their information content, were introduced into the model one by one and were sorted. The number of validated descriptors was in proportion with the sample size in the jackknife. The PLS-jackknife parameters, such as sample size and number and number of latent variables in PLS, and the starting descriptor in Gram-Schmidt orthogonalization were investigated and optimized.Applying PLS-jackknife to orthogonalized data in the optimized condition, five descriptors were validated with q²TOT2 ) 0.693 and R² ) 0.811. Compared to the previous reports, the obtained results are satisfactory.
本研究实施了一种稳健的基于刀切法的描述符选择程序,并辅以 Gram-Schmidt 正交化。本研究考虑了 Selwood 数据,其中包含 31 个分子和 53 个描述符。在刀切程序中,同时应用了多元线性回归(MLR)和偏最小二乘(PLS)回归方法,当在自动缩放和正交化数据集上均应用 PLS 回归时,得到了理想的结果。使用 Gram-Schmidt 技术后,所有描述符都被正交化,数量减少到 30 个。当将 PLS-刀切应用于 Gram-Schmidt 正交化数据时,得到了一组可重复的描述符。应用简单的统计 t 检验来确定刀切重采样得到的回归系数的显著性。通过逐个引入和排序基于信息含量的描述符来增加样本量和描述符。在刀切中验证的描述符数量与样本量成正比。研究并优化了 PLS-刀切的参数,如样本大小、PLS 中的潜在变量数量和 Gram-Schmidt 正交化中的起始描述符。在优化条件下,将 PLS-刀切应用于正交化数据,验证了五个描述符,q²TOT2)为 0.693,R²)为 0.811。与之前的报告相比,得到的结果令人满意。