Hemmateenejad Bahram, Shamsipur Mojtaba, Miri Ramin, Elyasi Maryam, Foroghinia Farzaneh, Sharghi Hashem
Medicinal & Natural Product Chemistry Research Center, Shiraz University of Medical Sciences, Shiraz, Iran.
Anal Chim Acta. 2008 Mar 3;610(1):25-34. doi: 10.1016/j.aca.2008.01.011. Epub 2008 Jan 15.
A quantitative structure-property relation (QSPR) study was conducted on the solubility in supercritical fluid carbon dioxide (SCF-CO2) of some recently synthesized anthraquinone, anthrone and xanthone derivatives. The data set consisted of 29 molecules in various temperatures and pressures, which form 1190 solubility data. The combined data splitting-feature selection (CDFS) strategy, which previously developed in our research group, was used as descriptor selection and model development method. Modeling of the relationship between selected molecular descriptors and solubility data was achieved by linear (multiple linear regression; MLR) and nonlinear (artificial neural network; ANN) methods. The QSPR models were validated by cross-validation as well as application of the models to predict the solubility of three external set compounds, which did not have contribution in model development steps. Both linear and nonlinear methods resulted in accurate prediction whereas more accurate results were obtained by ANN model. The respective root mean square error of prediction obtained by MLR and ANN models were 0.284 and 0.095 in the term of logarithm of g solute m(-3) of SCF-CO2. A comparison was made between the models selected by CDFS method and the conventional stepwise feature selection method. It was found that the latter produced models with higher number of descriptors and lowered prediction ability, thus it can be considered as an over-fitted model.
对一些最近合成的蒽醌、蒽酮和呫吨酮衍生物在超临界流体二氧化碳(SCF-CO₂)中的溶解度进行了定量结构-性质关系(QSPR)研究。数据集由29个分子在不同温度和压力下的数据组成,共形成1190个溶解度数据。我们研究小组之前开发的组合数据拆分-特征选择(CDFS)策略被用作描述符选择和模型开发方法。通过线性(多元线性回归;MLR)和非线性(人工神经网络;ANN)方法对所选分子描述符与溶解度数据之间的关系进行建模。QSPR模型通过交叉验证以及将模型应用于预测三种外部集化合物的溶解度进行验证,这三种化合物在模型开发步骤中没有贡献。线性和非线性方法都能得到准确的预测结果,而ANN模型得到的结果更准确。就SCF-CO₂中溶质g m⁻³的对数而言,MLR和ANN模型获得的各自预测均方根误差分别为0.284和0.095。对CDFS方法选择的模型与传统逐步特征选择方法选择的模型进行了比较。发现后者产生的模型描述符数量更多,预测能力更低,因此可被视为过拟合模型。