Gramatica Paola, Giani Elisa, Papa Ester
Department of Structural and Functional Biology, QSAR Research Unit in Environmental Chemistry and Ecotoxicology, University of Insubria, via Dunant 3, 21100 Varese, Italy.
J Mol Graph Model. 2007 Mar;25(6):755-66. doi: 10.1016/j.jmgm.2006.06.005. Epub 2006 Aug 4.
The soil sorption partition coefficient (log K(oc)) of a heterogeneous set of 643 organic non-ionic compounds, with a range of more than 6 log units, is predicted by a statistically validated QSAR modeling approach. The applied multiple linear regression (ordinary least squares, OLS) is based on a variety of theoretical molecular descriptors selected by the genetic algorithms-variable subset selection (GA-VSS) procedure. The models were validated for predictivity by different internal and external validation approaches. For external validation we applied self organizing maps (SOM) to split the original data set: the best four-dimensional model, developed on a reduced training set of 93 chemicals, has a predictivity of 78% when applied on 550 validation chemicals (prediction set). The selected molecular descriptors, which could be interpreted through their mechanistic meaning, were compared with the more common physico-chemical descriptors log K(ow) and log S(w). The chemical applicability domain of each model was verified by the leverage approach in order to propose only reliable data. The best predicted data were obtained by consensus modeling from 10 different models in the genetic algorithm model population.
采用经统计学验证的定量构效关系(QSAR)建模方法,预测了643种结构各异的有机非离子化合物的土壤吸附分配系数(log K(oc)),其范围超过6个对数单位。所应用的多元线性回归(普通最小二乘法,OLS)基于通过遗传算法-变量子集选择(GA-VSS)程序选择的各种理论分子描述符。通过不同的内部和外部验证方法对模型的预测能力进行了验证。对于外部验证,我们应用自组织映射(SOM)对原始数据集进行划分:在由93种化学物质组成的简化训练集上开发的最佳四维模型,应用于550种验证化学物质(预测集)时,预测能力为78%。将可通过其机理意义进行解释的所选分子描述符与更常见的物理化学描述符log K(ow)和log S(w)进行了比较。通过杠杆法验证了每个模型的化学适用域,以便仅提出可靠的数据。通过对遗传算法模型群体中的10个不同模型进行共识建模,获得了最佳预测数据。