St. Anthony Falls Laboratory, University of Minnesota, Minneapolis, MN 55414, USA.
Water Res. 2013 Sep 15;47(14):5362-70. doi: 10.1016/j.watres.2013.06.011. Epub 2013 Jun 15.
The aqueous solubility (log S) of xenobiotic chemicals has been identified as a key characteristic in determining their bioaccessibility/bioavailability and their fate and transport in aquatic environments. We here explore and evaluate the use of a state-of-the-art data analysis technique (Project to Latent Structures, PLS) to estimate log S of environmentally relevant chemicals. A large number (n = 624) of molecular descriptors was computed for over 1400 organic chemicals, and then refined by a feature selection technique. Candidate predictor descriptors were fitted to data by means of PLS, which was optimized by an internal leave-one-out cross-validation technique and validated by an external data set. The final (best) PLS model with only four variables (AlogP, X1sol, Mv, and E) exhibited noteworthy stability and good predictive power. It was able to explain 91% of the data (n = 1400) variance with an average absolute error of 0.5 log units through the solubilities span over 12 orders of magnitude. The newly proposed model is transparent, easily portable from one user to another, and robust enough to accurately estimate log S of a wide range of emerging contaminants.
外源性化学物质的水溶解度(log S)已被确定为决定其生物可及性/生物利用度以及在水生环境中的归宿和迁移的关键特性。在这里,我们探索并评估了使用最先进的数据分析技术(潜在结构计划,PLS)来估计环境相关化学物质的 log S。为超过 1400 种有机化学品计算了大量(n = 624)分子描述符,并通过特征选择技术进行了精炼。候选预测描述符通过 PLS 拟合到数据中,通过内部留一法交叉验证技术进行优化,并通过外部数据集进行验证。最终(最佳)PLS 模型仅使用四个变量(AlogP、X1sol、Mv 和 E)具有显著的稳定性和良好的预测能力。它能够通过跨越 12 个数量级的溶解度解释 91%的数据(n = 1400)方差,平均绝对误差为 0.5 log 单位。新提出的模型是透明的,易于在用户之间移植,并且足够稳健,可以准确估计广泛的新兴污染物的 log S。