Khrisanfov Mikhail D, Matyushin Dmitriy D, Samokhin Andrey S
Chemistry Department, Lomonosov Moscow State University, Leninskie Gory 1-3, 119991, Moscow, Russia; A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071, Moscow, Russia.
A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071, Moscow, Russia.
Anal Chim Acta. 2024 Apr 8;1297:342375. doi: 10.1016/j.aca.2024.342375. Epub 2024 Feb 17.
The NIST retention index database is one the most widely used sources of retention indices. In both untargeted analysis and machine learning studies filtering for potential errors is rather lacking or nonexistent. According to our estimates about 80% of the compounds from both NIST 17 and NIST 20 retention index databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection is also impractical because the database contains more than 300 000 entries.
We suggest a two-step procedure to find potentially erroneous retention indices based on machine learning. The first step is to use five predictive models to obtain predicted retention index values for the whole database. The second one is to compare these predicted values against the experimental ones. We consider a retention index erroneous if its accuracy (the difference between predicted and experimental value) is in the bottom 5% for each of the five models simultaneously. Using this method, we were able to detect 2093 outlier entries for standard and semi-standard non-polar stationary phases in the NIST 17 retention index database, 566 of those were corrected or removed by the developers in the NIST 20.
This is a novel approach to find potentially erroneous entries in a large-scale database with mostly unique entries, which can be applied not only to retention indices. The procedure can help filter and report mishandled data to improve the quality of the dataset for machine learning applications and experimental use.
美国国家标准与技术研究院(NIST)保留指数数据库是使用最广泛的保留指数来源之一。在非靶向分析和机器学习研究中,对潜在错误的筛选相当缺乏或根本不存在。据我们估计,NIST 17和NIST 20保留指数数据库中约80%的化合物在每个固定相上只有一个保留指数值,这使得用统计方法寻找错误值变得不可能。人工检查也不切实际,因为该数据库包含超过30万个条目。
我们提出了一种基于机器学习的两步程序来查找潜在错误的保留指数。第一步是使用五个预测模型为整个数据库获得预测的保留指数值。第二步是将这些预测值与实验值进行比较。如果某个保留指数的准确性(预测值与实验值之间的差异)在五个模型中的每一个中都同时处于底部5%,我们就认为该保留指数是错误的。使用这种方法,我们能够在NIST 17保留指数数据库中检测到2093个标准和半标准非极性固定相的异常条目,其中566个已被NIST 20的开发者修正或删除。
这是一种在大多数条目唯一的大规模数据库中查找潜在错误条目的新方法,它不仅可以应用于保留指数。该程序有助于筛选和报告处理不当的数据,以提高用于机器学习应用和实验使用的数据集的质量。