Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E9, Canada.
Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E8, Canada.
J Chromatogr A. 2023 Aug 30;1705:464176. doi: 10.1016/j.chroma.2023.464176. Epub 2023 Jun 24.
We describe a freely available web server called Retention Index Predictor (RIpred) (https://ripred.ca) that rapidly and accurately predicts Gas Chromatographic Kováts Retention Indices (RI) using SMILES strings as chemical structure input. RIpred performs RI prediction for three different stationary phases (semi-standard non-polar (SSNP), standard non-polar (SNP), and standard polar (SP)) for both derivatized (trimethylsilyl (TMS) and tert‑butyldimethylsilyl (TBDMS) derivatized) and underivatized (base compound) forms of GC-amenable structures. RIpred was developed to address the need for freely available, fast, highly accurate RI predictions for a wide range of derivatized and underivatized chemicals for all common GC stationary phases. RIpred was trained using a Graph Neural Network (GNN) that used compound structures, their extracted features (mostly atom-level features) and the GC-RI data from the National Institute of Standards and Technology databases (NIST 17 and NIST 20). We curated this NIST 17 and NIST 20 GC-RI data, which is available for all three stationary phases, to create appropriate inputs (molecular graphs in this case) needed to enhance our model performance. The performance of different RIpred predictive models was evaluated using 10-fold cross validation (CV). The best performing RIpred models were identified and when tested on hold-out test sets from all stationary phases, achieved a Mean Absolute Error (MAE) of <73 RI units (SSNP: 16.5-29.5, SNP: 38.5-45.9, SP: 46.52-72.53). The Mean Absolute Percentage Error (MAPE) of these models were typically within 3% (SSNP: 0.78-1.62%, SNP: 1.87-2.88%, SP: 2.34-4.05%). When compared to the best performing model by Qu et al., 2021, RIpred performed similarly (MAE of 16.57 RI units [RIpred] vs. 16.84 RI units [Qu et al., 2021 predictor] for derivatized compounds). RIpred also includes ∼5 million predicted RI values for all GC-amenable compounds (∼57,000) in the Human Metabolome Database HMDB 5.0 (Wishart et al., 2022).
我们描述了一个名为 Retention Index Predictor(RIpred)的免费网络服务器(https://ripred.ca),它可以使用 SMILES 字符串作为化学结构输入,快速准确地预测气相色谱科瓦茨保留指数(RI)。RIpred 可以为三种不同的固定相(半标准非极性(SSNP)、标准非极性(SNP)和标准极性(SP))预测衍生化(三甲基硅基(TMS)和叔丁基二甲基硅基(TBDMS)衍生化)和未衍生化(基本化合物)形式的 GC 可处理结构的 RI。RIpred 的开发是为了满足对各种衍生化和未衍生化化学物质的广泛需求,这些化学物质适用于所有常见的 GC 固定相,需要快速、高度准确的 RI 预测。RIpred 是使用图神经网络(GNN)训练的,该网络使用化合物结构、提取的特征(主要是原子级特征)和国家标准与技术研究所(NIST)数据库中的 GC-RI 数据(NIST 17 和 NIST 20)。我们整理了这个 NIST 17 和 NIST 20 GC-RI 数据,这些数据可用于所有三种固定相,以创建增强模型性能所需的适当输入(在这种情况下是分子图)。使用 10 折交叉验证(CV)评估不同 RIpred 预测模型的性能。确定性能最佳的 RIpred 模型,并在所有固定相的保留测试集上进行测试时,实现了<73 RI 单位的平均绝对误差(MAE)(SSNP:16.5-29.5,SNP:38.5-45.9,SP:46.52-72.53)。这些模型的平均绝对百分比误差(MAPE)通常在 3%以内(SSNP:0.78-1.62%,SNP:1.87-2.88%,SP:2.34-4.05%)。与 Qu 等人 2021 年表现最佳的模型相比,RIpred 的表现类似(衍生化合物的 MAE 为 16.57 RI 单位[RIpred]与 16.84 RI 单位[Qu 等人,2021 年预测器])。RIpred 还包含约 500 万个适用于所有 GC 的化合物的预测 RI 值(约 57,000 个),这些化合物包含在人类代谢组数据库 HMDB 5.0(Wishart 等人,2022 年)中。