von Gerichten Johanna, Saunders Kyle, Bailey Melanie J, Gethings Lee A, Onoja Anthony, Geifman Nophar, Spick Matt
School of Chemistry and Chemical Engineering, Faculty of Engineering and Physical Sciences, University of Surrey, Guildford GU2 7XH, UK.
Waters Corporation, Wilmslow SK9 4AX, UK.
Metabolites. 2024 Aug 19;14(8):461. doi: 10.3390/metabo14080461.
Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC-MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in biomarker identification. In this work, we illustrate the reproducibility gap for two open-access lipidomics platforms, MS DIAL and Lipostar, finding just 14.0% identification agreement when analyzing identical LC-MS spectra using default settings. Whilst the software platforms performed more consistently using fragmentation data, agreement was still only 36.1% for MS spectra. This highlights the critical importance of validation across positive and negative LC-MS modes, as well as the manual curation of spectra and lipidomics software outputs, in order to reduce identification errors caused by closely related lipids and co-elution issues. This curation process can be supplemented by data-driven outlier detection in assessing spectral outputs, which is demonstrated here using a novel machine learning approach based on support vector machine regression combined with leave-one-out cross-validation. These steps are essential to reduce the frequency of false positive identifications and close the reproducibility gap, including between software platforms, which, for downstream users such as bioinformaticians and clinicians, can be an underappreciated source of biomarker identification errors.
在液相色谱 - 质谱(LC - MS)脂质组学研究中,识别具有高置信度的特征是生物标志物发现的重要组成部分,但现有的软件平台可能会给出不一致的结果,即使是来自相同的光谱数据。这对生物标志物识别的可重复性构成了明显挑战。在这项工作中,我们展示了两个开放获取的脂质组学平台MS DIAL和Lipostar的可重复性差距,发现在使用默认设置分析相同的LC - MS光谱时,识别一致性仅为14.0%。虽然软件平台在使用碎片数据时表现得更一致,但质谱图的一致性仍仅为36.1%。这凸显了在正、负LC - MS模式下进行验证以及对光谱和脂质组学软件输出进行人工整理的至关重要性,以减少由密切相关的脂质和共洗脱问题导致的识别错误。在评估光谱输出时,可以通过数据驱动的异常值检测来补充这种整理过程,本文使用基于支持向量机回归结合留一法交叉验证的新型机器学习方法对此进行了演示。这些步骤对于减少假阳性识别的频率和弥合可重复性差距至关重要,包括在软件平台之间,对于生物信息学家和临床医生等下游用户来说,这可能是生物标志物识别错误中一个未得到充分认识的来源。