Noreldeen Hamada A A
National Institute of Oceanography and Fisheries, NIOF, Cairo, Egypt.
J Chromatogr A. 2025 Feb 8;1742:465650. doi: 10.1016/j.chroma.2024.465650. Epub 2025 Jan 3.
The comprehensive identification of peaks in untargeted lipidomics using LC-MS/MS remains a significant challenge. Confidence in lipid annotation can be greatly improved by integrating a highly accurate machine learning-based retention time prediction model. Such an approach enables the identification of lipids for understanding pathogenic mechanisms, biomarker discovery, and drug screening. In this study, we developed a machine learning model to predict retention times and facilitate lipid peak annotations in LC-MS-based untargeted lipidomics. Our model achieved high correlation coefficients of 0.998 and 0.990, with mean absolute errors (MAE) of 0.107 min and 0.240 min for the training and test sets, respectively. External validation showed similarly strong performance, with correlations of 0.991 and 0.978, and MAE values of 0.241 min and 0.270 min. We also compared the impact of molecular descriptors and molecular fingerprints on the model's performance, finding that molecular descriptors outperformed molecular fingerprints across all datasets when using Random Forest (RF) for model construction. Notably, this retention time calibration model demonstrates robust performance across chromatographic systems with comparable gradients and flow rates. Overall, this machine learning model enhances lipid annotation accuracy and reduces errors in untargeted lipidomics, improving data analysis across multiple datasets.
使用液相色谱-串联质谱(LC-MS/MS)对非靶向脂质组学中的峰进行全面鉴定仍然是一项重大挑战。通过整合基于高精度机器学习的保留时间预测模型,可以大大提高脂质注释的可信度。这种方法能够鉴定脂质,以了解致病机制、发现生物标志物和进行药物筛选。在本研究中,我们开发了一种机器学习模型来预测保留时间,并促进基于LC-MS的非靶向脂质组学中的脂质峰注释。我们的模型在训练集和测试集上分别取得了0.998和0.990的高相关系数,平均绝对误差(MAE)分别为0.107分钟和0.240分钟。外部验证显示出同样强大的性能,相关系数分别为0.991和0.978,MAE值分别为0.241分钟和0.270分钟。我们还比较了分子描述符和分子指纹对模型性能的影响,发现在使用随机森林(RF)构建模型时,分子描述符在所有数据集中的表现均优于分子指纹。值得注意的是,这种保留时间校准模型在具有可比梯度和流速的色谱系统中表现出稳健的性能。总体而言,这种机器学习模型提高了非靶向脂质组学中脂质注释的准确性,减少了误差,改善了多个数据集的数据分析。