Suppr超能文献

一大组萜类化合物的科瓦茨保留指数的定量结构-保留关系:一种组合数据拆分-特征选择策略

Quantitative structure-retention relationship for the Kovats retention indices of a large set of terpenes: a combined data splitting-feature selection strategy.

作者信息

Hemmateenejad Bahram, Javadnia Katayoun, Elyasi Maryam

机构信息

Chemistry Department, Shiraz University, Shiraz 71454, Iran.

出版信息

Anal Chim Acta. 2007 May 29;592(1):72-81. doi: 10.1016/j.aca.2007.04.009. Epub 2007 Apr 8.

Abstract

A data set consisting of a large number of terpenoids, the widely distributed compounds in nature that are found in abundance in higher plants, have been used to develop a quantitative structure property relationship (QSPR) for their Kovats retention index. QSPR models are usually obtained by splitting the data into two sets including calibration (or training) and prediction (or validation). All model building steps, especially feature selection procedure, are performed using this initial splitting, and therefore the performances of the resulted models are highly dependent on the initial data splitting. To investigate the effects of data splitting on the feature selection in the current article we proposed a combined data splitting-feature selection (CDFS) methodology for QSPR model development by producing several different training/validation/test sets, and repeating all of the model building studies. In this method, data splitting is achieved many times and in each case feature selection is performed. The resulted models are compared for similarity and dissimilarity between the selected descriptors. The final model is one whose descriptors are the common variables between all of resulted models. The method was applied to QSPR study of a large data set containing the Kovats retention indices of 573 terpenoids. A final 8-parametric multilinear model with constitutional and topological indices was obtained. Cross-validation indicated that the model could reproduce more than 90% of variances in the Kovats retention data. The relative error of prediction for an external test set of 50 compounds was 3.2%. Finally, to improve the results, structure-retention relationships were followed by nonlinear approach using artificial neural networks and consequently better results were obtained.

摘要

一组由大量萜类化合物组成的数据集被用于建立其科瓦茨保留指数的定量结构-性质关系(QSPR)模型,萜类化合物是自然界中广泛分布的化合物,在高等植物中大量存在。QSPR模型通常通过将数据分为两组来获得,这两组分别是校准(或训练)组和预测(或验证)组。所有的模型构建步骤,尤其是特征选择过程,都是基于这种初始划分来进行的,因此所得模型的性能高度依赖于初始数据划分。为了研究数据划分对当前文章中特征选择的影响,我们提出了一种用于QSPR模型开发的组合数据划分-特征选择(CDFS)方法,通过生成几个不同的训练/验证/测试集,并重复所有的模型构建研究。在这种方法中,多次进行数据划分,并且在每种情况下都进行特征选择。比较所得模型在所选描述符之间的相似性和差异性。最终模型是其描述符为所有所得模型之间的共同变量的模型。该方法被应用于对包含573种萜类化合物的科瓦茨保留指数的大数据集进行QSPR研究。获得了一个最终的包含结构和拓扑指数的8参数多线性模型。交叉验证表明该模型能够重现科瓦茨保留数据中超过90%的方差。50种化合物的外部测试集的预测相对误差为3.2%。最后,为了改进结果,采用人工神经网络的非线性方法研究结构-保留关系,从而获得了更好的结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验