Suppr超能文献

[基于不同极性固定相的气相色谱保留指数构建机器学习集成预测模型]

[Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities].

作者信息

Wang Qian-Yi, Zhu Yong-le, Li Xue-Hua

机构信息

Key Laboratory of Industrial Ecology and Environmental Engineering, Ministry of Education, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China.

出版信息

Se Pu. 2025 Apr 8;43(4):355-362. doi: 10.3724/SP.J.1123.2024.07014.

Abstract

Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination () of 0.99, a training set root mean square error (RMSE) of 101.85, a test set of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.

摘要

气相色谱法是一种广泛用于分离和鉴定各种化合物的分析技术。保留指数(RI)在气相色谱中起着重要作用,因为它为表征化合物在特定条件下的保留性能提供了一种标准化的度量,并且是一种强大的化合物鉴定工具,特别是在处理复杂混合物时。因此,预测RI值的能力是一个有意义的目标,特别是对于多极相而言,因为在各种极性固定相上RI存在显著差异。为了解决这个问题,我们通过从文献和数据库中收集2499种化合物在八种类型固定相上的4183条保留指数数据,开发了一个用于预测不同极性固定相上气相色谱保留指数的模型。固定相根据其麦克雷诺兹常数进一步分为五类,即:强极性、极性、中等极性、弱极性和非极性。这种分类确保模型能够处理广泛的极性范围,从而提高其通用性和对各种分析场景的适用性。预测模型通过整合两种类型的复合特征构建。首先确定化合物的一维和二维分子结构特征;这些特征捕获化合物的化学和物理性质,包括它们的相对分子量、官能团和拓扑指数。这些描述符提供了对影响保留行为的分子特征的全面理解。然后对固定相极性进行独热编码,将分类的固定相极性信息转换为机器学习算法可以有效使用的格式。这种编码技术确保模型能够区分各种极性对化合物保留行为的影响。使用九种算法构建预测性机器学习模型,包括线性回归、决策树、随机森林、支持向量机(SVM)、k近邻(KNN)、梯度提升决策树(GBDT)、极端梯度提升(XGBoost)和轻梯度提升(LightGBM)算法。使用投票回归基于XGBoost和LightGBM算法构建性能最优的集成学习模型。这个集成模型结合了多个个体模型的优势,表现出卓越的性能,训练集决定系数()为0.99,训练集均方根误差(RMSE)为101.85,测试集为0.97,测试集RMSE为107.44。使用威廉姆斯图来表征模型的应用领域,超过94%的数据位于该领域内,表明其具有广泛的适用性和较高的预测置信度。这个预测保留指数模型的成功开发代表了气相色谱领域中的一项重大进展。所开发的模型通过将先进的机器学习技术与全面的化学和物理性质数据相结合,提供了几个关键优势;它能够高精度地预测广泛极性固定相上的RI值。与单个机器学习模型相比,所开发的集成模型表现出卓越的稳健性和预测能力。该模型的建立对于提高目标和非目标气相色谱分析的效率和准确性具有重要的科学意义和实用价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/69fe/11966378/a9b72d46209a/img_1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验