基于液相色谱-质谱联用代谢组学中常见峰识别算法差异的机制理解

Mechanistic Understanding of the Discrepancies between Common Peak Picking Algorithms in Liquid Chromatography-Mass Spectrometry-Based Metabolomics.

作者信息

Guo Jian, Huan Tao

机构信息

Department of Chemistry, Faculty of Science, University of British Columbia, Vancouver Campus, 2036 Main Mall, Vancouver V6T 1Z1, BC, Canada.

出版信息

Anal Chem. 2023 Apr 11;95(14):5894-5902. doi: 10.1021/acs.analchem.2c04887. Epub 2023 Mar 27.

DOI:10.1021/acs.analchem.2c04887

PMID:36972195

Abstract

Inconsistent peak picking outcomes are a critical concern in processing liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics data. This work systematically studied the mechanisms behind the discrepancies among five commonly used peak picking algorithms, including CentWave in XCMS, linear-weighted moving average in MS-DIAL, automated data analysis pipeline (ADAP) in MZmine 2, Savitzky-Golay in El-MAVEN, and FeatureFinderMetabo in OpenMS. We first collected 10 public metabolomics datasets representing various LC-MS analytical conditions. We then incorporated several novel strategies to (i) acquire the optimal peak picking parameters of each algorithm for a fair comparison, (ii) automatically recognize false metabolic features with poor chromatographic peak shapes, and (iii) evaluate the real metabolic features that are missed by the algorithms. By applying these strategies, we compared the true, false, and undetected metabolic features in each data processing outcome. Our results show that linear-weighted moving average consistently outperforms the other peak picking algorithms. To facilitate a mechanistic understanding of the differences, we proposed six peak attributes: ideal slope, sharpness, peak height, mass deviation, peak width, and scan number. We also developed an R program to automatically measure these attributes for detected and undetected true metabolic features. From the results of the 10 datasets, we concluded that four peak attributes, including ideal slope, scan number, peak width, and mass deviation, are critical for the detectability of a peak. For instance, the focus on ideal slope critically hinders the extraction of true metabolic features with low ideal slope scores in linear-weighted moving average, Savitzky-Golay, and ADAP. The relationships between peak picking algorithms and peak attributes were also visualized in a principal component analysis biplot. Overall, the clear comparison and explanation of the differences between peak picking algorithms can lead to the design of better peak picking strategies in the future.

摘要

在基于液相色谱 - 质谱联用（LC - MS）的非靶向代谢组学数据处理中，不一致的峰识别结果是一个关键问题。这项工作系统地研究了五种常用峰识别算法之间差异背后的机制，这些算法包括XCMS中的CentWave、MS - DIAL中的线性加权移动平均、MZmine 2中的自动数据分析管道（ADAP）、El - MAVEN中的Savitzky - Golay以及OpenMS中的FeatureFinderMetabo。我们首先收集了10个代表各种LC - MS分析条件的公开代谢组学数据集。然后，我们采用了几种新颖的策略：（i）获取每种算法的最佳峰识别参数以进行公平比较；（ii）自动识别色谱峰形状不佳的假代谢特征；（iii）评估算法遗漏的真实代谢特征。通过应用这些策略，我们比较了每个数据处理结果中的真、假和未检测到的代谢特征。我们的结果表明，线性加权移动平均始终优于其他峰识别算法。为了便于从机制上理解这些差异，我们提出了六个峰属性：理想斜率、尖锐度、峰高、质量偏差、峰宽和扫描次数。我们还开发了一个R程序来自动测量检测到的和未检测到的真实代谢特征的这些属性。从10个数据集的结果来看，我们得出结论，包括理想斜率、扫描次数、峰宽和质量偏差在内的四个峰属性对峰的可检测性至关重要。例如，对理想斜率的关注严重阻碍了线性加权移动平均、Savitzky - Golay和ADAP中具有低理想斜率分数的真实代谢特征的提取。峰识别算法与峰属性之间的关系也在主成分分析双标图中进行了可视化展示。总体而言，对峰识别算法之间差异的清晰比较和解释能够在未来促成更好的峰识别策略的设计。