National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, USA.
J Chromatogr A. 2021 Jun 7;1646:462100. doi: 10.1016/j.chroma.2021.462100. Epub 2021 Mar 25.
The Kováts retention index is a dimensionless quantity that characterizes the rate at which a compound is processed through a gas chromatography column. This quantity is independent of many experimental variables and, as such, is considered a near-universal descriptor of retention time on a chromatography column. The Kováts retention indices of a large number of molecules have been determined experimentally. The "NIST 20: GC Method/Retention Index Library" database has collected and, more importantly, curated retention indices of a subset of these compounds resulting in a highly valued reference database. The experimental data in the library form an ideal data set for training machine learning models for the prediction of retention indices of unknown compounds. In this article, we describe the training of a graph neural network model to predict the Kováts retention index for compounds in the NIST library and compare this approach with previous work [1]. We predict the Kováts retention index with a mean unsigned error of 28 index units as compared to 44, the putative best result using a convolutional neural network [1]. The NIST library also incorporates an estimation scheme based on a group contribution approach that achieves a mean unsigned error of 114 compared to the experimental data. Our method uses the same input data source as the group contribution approach, making its application straightforward and convenient to apply to existing libraries. Our results convincingly demonstrate the predictive powers of systematic, data-driven approaches leveraging deep learning methodologies applied to chemical data and for the data in the NIST 20 library outperform previous models.
科瓦茨保留指数是一个无量纲的量,用于描述化合物在气相色谱柱中被处理的速度。这个量与许多实验变量无关,因此被认为是色谱柱上保留时间的近通用描述符。大量分子的科瓦茨保留指数已经通过实验确定。“NIST 20:GC 方法/保留指数库”数据库收集了这些化合物中一部分的保留指数,更重要的是,对其进行了整理,从而形成了一个非常有价值的参考数据库。库中的实验数据构成了用于预测未知化合物保留指数的机器学习模型训练的理想数据集。在本文中,我们描述了一种用于预测 NIST 库中化合物的科瓦茨保留指数的图神经网络模型的训练,并将其与之前的工作进行了比较。与使用卷积神经网络的 44 个假定最佳结果相比,我们的方法预测科瓦茨保留指数的平均无偏差误差为 28 个指数单位。NIST 库还包含一种基于基团贡献方法的估计方案,与实验数据相比,其平均无偏差误差为 114。我们的方法使用与基团贡献方法相同的输入数据源,使其应用简单方便,可应用于现有库。我们的结果令人信服地证明了系统的预测能力,这些方法利用深度学习方法应用于化学数据,并且对于 NIST 20 库中的数据,其表现优于之前的模型。