Lei Bin, Zang Yunlei, Xue Zhiwei, Ge Yiqing, Li Wei, Zhai Qian, Jiao Long
College of Chemistry and Chemical Engineering, Xi'an Shiyou University, Xi'an 710065, China.
No. 203 Research Institute of Nuclear Industry, Xianyang 712000, China.
Se Pu. 2021 Mar;39(3):331-337. doi: 10.3724/SP.J.1123.2020.06011.
Chromatographic retention index (RI) is an important parameter for describing the retention behavior of substances in chromatographic analysis. Experimentally determining the RI values of different aldehyde and ketone compounds in all kinds of polar stationary phases is expensive and time consuming. Quantitative structure activity relationship (QSAR) is an important chemometric technique that has been widely used to correlate the properties of chemicals to their molecular structures. Irrespective of whether the properties of a molecule have been experimentally determined, they can be calculated using QSAR models. It is therefore necessary and advisable to establish the QSAR model for predicting the RI value of aldehydes and ketones. Hologram QSAR (HQSAR) is a highly efficient QSAR approach that can easily generate QSAR models with good statistics and high prediction accuracy. A specific fragment of fingerprint, known as a molecular hologram, is proposed in the HQSAR approach and used as a structural descriptor to build the proposed QSAR model. In general, individual HQSAR models are built in QSAR researches. However, individual QSAR models are usually affected by underfitting and overfitting. The ensemble modeling method, which integrate several individual models through certain consensus strategies, can overcome the shortcomings of individual models. It is worth studying whether ensemble modeling can improve the prediction ability of the HQSAR method in order to build more accurate and reliable QSAR models. Therefore, this study investigates the QSAR model for chromatographic RI of aldehydes and ketones using ensemble modeling and the HQSAR method. Two individual HQSAR models comprising 34 compounds in two stationary phases, DB-210 and HP-Innowax, were established. The prediction ability of the two established models was assessed by external test set validation and leave-one-out cross validation (LOO-CV). The investigated 34 compounds were randomly assigned into two groups. Group Ⅰ comprised 26 compounds, and Group Ⅱ comprised 8 compounds. In the validation of the external test set, Group Ⅰ was employed to manually optimize the two fragment parameters (fragment distinction (FD) and fragment size (FS)) and build the HQSAR models. Group Ⅱ was used as the test set to assess the predictive performance of the developed models. For the DB-210 stationary phase, the optimal individual HQSAR model was obtained while setting the FD and FS to "donor/acceptor atoms (DA)" and 1-9, respectively. For the HP-Innowax stationary phase, the optimal individual HQSAR model was obtained by setting the FD and FS to "DA" and 4-7 respectively. The squared correlation coefficient of cross validation ( [Formula: see text] for predicting the RI values of the DB-210 and HP-Innowax stationary phases were 0.927 and 0.919, 0.956 and 0.979, 0.929 and 0.963, 0.927 and 0.958, and 0.935 and 0.963, respectively. Compared to the individual HQSAR models, the established ensemble HQSAR models show better robustness and accuracy, thus establishing that ensemble modeling is an effective approach. The combination of HQSAR and the ensemble modeling method is a practicable and promising method for studying and predicting the RI values of aldehydes and ketones.
色谱保留指数(RI)是描述物质在色谱分析中保留行为的重要参数。通过实验测定各种极性固定相中不同醛酮化合物的RI值既昂贵又耗时。定量构效关系(QSAR)是一种重要的化学计量学技术,已被广泛用于将化学物质的性质与其分子结构相关联。无论分子的性质是否已通过实验测定,都可以使用QSAR模型进行计算。因此,建立用于预测醛酮RI值的QSAR模型是必要且可行的。全息定量构效关系(HQSAR)是一种高效的QSAR方法,能够轻松生成具有良好统计性和高预测准确性的QSAR模型。HQSAR方法中提出了一种特定的指纹片段,称为分子全息图,并用作构建所提出的QSAR模型的结构描述符。一般来说,在QSAR研究中构建的是单个HQSAR模型。然而,单个QSAR模型通常受到欠拟合和过拟合的影响。通过特定的共识策略整合多个个体模型的集成建模方法可以克服个体模型的缺点。研究集成建模是否可以提高HQSAR方法的预测能力以构建更准确可靠的QSAR模型是值得的。因此,本研究使用集成建模和HQSAR方法研究醛酮色谱RI的QSAR模型。建立了两个包含34种化合物在DB - 210和HP - Innowax两种固定相中的单个HQSAR模型。通过外部测试集验证和留一法交叉验证(LOO - CV)评估所建立的两个模型的预测能力。将所研究的34种化合物随机分为两组。第一组包含26种化合物,第二组包含8种化合物。在外部测试集验证中,第一组用于手动优化两个片段参数(片段区分(FD)和片段大小(FS))并构建HQSAR模型。第二组用作测试集以评估所开发模型的预测性能。对于DB - 210固定相,分别将FD和FS设置为“供体/受体原子(DA)”和l - 9时获得了最佳单个HQSAR模型。对于HP - Innowax固定相,分别将FD和FS设置为“DA”和4 - 7时获得了最佳单个HQSAR模型。预测DB - 210和HP - Innowax固定相RI值的交叉验证平方相关系数([公式:见文本])分别为0.927和0.919、0.956和0.979、0.929和0.963、0.927和0.958以及0.935和0.963。与单个HQSAR模型相比,所建立的集成HQSAR模型显示出更好的稳健性和准确性,从而证明集成建模是一种有效的方法。HQSAR与集成建模方法的结合是研究和预测醛酮RI值的一种可行且有前景的方法。