Liu Shu-Shen, Yin Chun-Sheng, Wang Lian-Sheng
State Key Laboratory of Pollution Control and Resources Reuse, Department of Environmental Science & Engineering, Nanjing University, Nanjing 210093, People's Republic of China.
J Chem Inf Comput Sci. 2002 May-Jun;42(3):749-56. doi: 10.1021/ci010245a.
The MEDV-13, molecular electronegativity distance vector based on 13 atomic types, has at best 91 descriptors. It is impossible to indirectly use multiple linear regression (MLR) to derive a quantitative structure-activity relationship (QSAR) model. Although principal component regression (PCR) or partial least-squares regression (PLSR) can be employed to develop a latent QSAR model, it is still difficult how to determine the principal components (PCs) and depict the physical meaning of the PCs. So, a genetic algorithm (GA) is first employed to select an optimal subset of the descriptors from original MEDV-13 descriptor set. Then MLR is utilized to build a QSAR model between the optimal subset and the biological activities of three sets of compounds. For 31 benchmark steroids, a 5-descriptor QSAR model (M1) between the corticosteroid-binding globulin (CBG) binding affinity of the steroids and 5-descriptor subset is developed. The root-mean-square error of estimations (RMSEE) and the correlation coefficient of estimations (r) between the CBG binding affinity (BA) observed and the BA estimated by M1 are 0.422 and 0.9182, respectively. The root-mean-square error of predictions (RMSEP) and the correlation coefficient of predictions (q) between the BA observed and the BA predicted by leave-one-out cross validations are 0.504 and 0.8818, respectively. For 58 dipeptides inhibiting angiotensin-converting enzyme (ACE), a 5-variable QSAR model (M2) between the pIC(50) of peptides and 5-descriptor subset is derived. The M2 has a high quality with RMSEE = 0.339 and r = 0.9398 and RMSEP = 0.370 and q = 0.9280. For 16 indomethacin amides and esters (ImAE) inhibiting cyclooxygenase-2 (COX-2), a 6-variable QSAR model (M3) with RMSEE = 0.079 and r = 0.9839 and RMSEP = 0.151 and q = 0.9413 is built.
基于13种原子类型的分子电负性距离向量MEDV-13最多有91个描述符。不可能间接使用多元线性回归(MLR)来推导定量构效关系(QSAR)模型。虽然可以采用主成分回归(PCR)或偏最小二乘回归(PLSR)来建立潜在的QSAR模型,但如何确定主成分(PCs)并描述其物理意义仍然很困难。因此,首先采用遗传算法(GA)从原始的MEDV-13描述符集中选择一个最优的描述符子集。然后利用MLR在最优子集和三组化合物的生物活性之间建立QSAR模型。对于31种基准甾体化合物,建立了甾体化合物与皮质类固醇结合球蛋白(CBG)结合亲和力之间的一个包含5个描述符的QSAR模型(M1)。观察到的CBG结合亲和力(BA)与M1估计的BA之间的估计均方根误差(RMSEE)和估计相关系数(r)分别为0.422和0.9182。留一法交叉验证预测的BA与观察到的BA之间的预测均方根误差(RMSEP)和预测相关系数(q)分别为0.504和0.8818。对于58种抑制血管紧张素转换酶(ACE)的二肽,推导了肽的pIC(50)与一个包含5个描述符的子集之间的一个包含5个变量的QSAR模型(M2)。M2具有高质量,RMSEE = 0.339,r = 0.9398,RMSEP = 0.370,q = 0.9280。对于16种抑制环氧合酶-2(COX-2)的吲哚美辛酰胺和酯(ImAE),建立了一个包含6个变量的QSAR模型(M3),RMSEE = 0.079,r = 0.9839,RMSEP = 0.151,q = 0.9413。