Boelrijk Jim, van Herwerden Denice, Ensing Bernd, Forré Patrick, Samanipour Saer
AI4Science Lab, University of Amsterdam, Amsterdam, The Netherlands.
Institute for Informatics, University of Amsterdam, Amsterdam, The Netherlands.
J Cheminform. 2023 Feb 24;15(1):28. doi: 10.1186/s13321-023-00699-8.
Non-target analysis combined with liquid chromatography high resolution mass spectrometry is considered one of the most comprehensive strategies for the detection and identification of known and unknown chemicals in complex samples. However, many compounds remain unidentified due to data complexity and limited number structures in chemical databases. In this work, we have developed and validated a novel machine learning algorithm to predict the retention index (r[Formula: see text]) values for structurally (un)known chemicals based on their measured fragmentation pattern. The developed model, for the first time, enabled the predication of r[Formula: see text] values without the need for the exact structure of the chemicals, with an [Formula: see text] of 0.91 and 0.77 and root mean squared error (RMSE) of 47 and 67 r[Formula: see text] units for the NORMAN ([Formula: see text]) and amide ([Formula: see text]) test sets, respectively. This fragment based model showed comparable accuracy in r[Formula: see text] prediction compared to conventional descriptor-based models that rely on known chemical structure, which obtained an [Formula: see text] of 0.85 with an RMSE of 67.
非靶向分析结合液相色谱高分辨率质谱被认为是检测和鉴定复杂样品中已知和未知化学物质最全面的策略之一。然而,由于数据复杂性和化学数据库中结构数量有限,许多化合物仍未得到鉴定。在这项工作中,我们开发并验证了一种新型机器学习算法,用于根据结构(未)知化学物质的测量碎片模式预测其保留指数(r[公式:见正文])值。所开发的模型首次实现了无需化学物质的精确结构即可预测r[公式:见正文]值,对于诺曼([公式:见正文])和酰胺([公式:见正文])测试集,其决定系数([公式:见正文])分别为0.91和0.77,均方根误差(RMSE)分别为47和67个r[公式:见正文]单位。与依赖已知化学结构的传统基于描述符的模型相比,这种基于碎片的模型在r[公式:见正文]预测中显示出相当的准确性,传统模型的决定系数([公式:见正文])为0.85,均方根误差(RMSE)为67。