NGAlab, La Riera de Gaia, Tarragona 43762, Spain.
West Coast Metabolomics Center, UC Davis Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, California 95616, United States.
Anal Chem. 2020 Jun 2;92(11):7515-7522. doi: 10.1021/acs.analchem.9b05765. Epub 2020 May 21.
Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM), and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test, and validation sets. Keras yielded a mean absolute error of 0.78 min for HILIC and 0.57 min for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.
LC-MS/MS 非靶向代谢组学中,未鉴定的峰仍然是一个主要问题。通过将 MS/MS 匹配和保留时间相结合,可以提高峰注释的可信度。本文展示了如何根据分子结构预测保留时间。使用机器学习在两个大型公共可用数据集上进行模型训练:Fiehn 亲水性相互作用液相色谱数据集中的 981 种初级代谢物和生物胺,以及 RIKEN 植物特殊代谢组注释(PlaSMA)数据库中的 852 种次级代谢物,该数据库使用反相液相色谱(RPLC)。五种不同的机器学习算法已被集成到 Retip R 包中:随机森林、贝叶斯正则化神经网络、XGBoost、轻梯度提升机(LightGBM)和 Keras 算法,用于构建保留时间预测模型。在 R 中开发了一个完整的保留时间预测工作流程。它可以从 GitHub 存储库(https://www.retip.app)免费下载。在测试集中,Keras 算法在最小过拟合方面优于其他机器学习算法,这可以通过训练集、测试集和验证集之间的小误差差异来验证。Keras 在 HILIC 中产生的平均绝对误差为 0.78 分钟,在 RPLC 中产生的平均绝对误差为 0.57 分钟。Retip 已集成到质谱软件工具 MS-DIAL 和 MS-FINDER 中,允许进行完整的化合物注释工作流程。在对小鼠血浆样本的测试应用中,当在 MS-FINDER 化合物鉴定软件中搜索所有异构体时,候选结构的数量减少了 68%。保留时间预测提高了液相色谱中的鉴定率,从而改善了代谢组学数据的生物学解释。