Department of Chemistry, University of Nebraska-Lincoln, Lincoln, Nebraska 68588-0304, United States.
Nebraska Center for Integrated Biomolecular Communication, University of Nebraska-Lincoln, Lincoln, Nebraska 68588-0304, United States.
Anal Chem. 2024 Aug 13;96(32):12943-12956. doi: 10.1021/acs.analchem.3c04979. Epub 2024 Jul 30.
Metabolomics commonly relies on using one-dimensional (1D) H NMR spectroscopy or liquid chromatography-mass spectrometry (LC-MS) to derive scientific insights from large collections of biological samples. NMR and MS approaches to metabolomics require, among other issues, a data processing pipeline. Quantitative assessment of the performance of these software platforms is challenged by a lack of standardized data sets with "known" outcomes. To resolve this issue, we created a novel simulated LC-MS data set with known peak locations and intensities, defined metabolite differences between groups (i.e., fold change > 2, coefficient of variation ≤ 25%), and different amounts of added Gaussian noise (0, 5, or 10%) and missing features (0, 10, or 20%). This data set was developed to improve benchmarking of existing LC-MS metabolomics software and to validate the updated version of our MVAPACK software, which added gas chromatography-MS and LC-MS functionality to its existing 1D and two-dimensional NMR data processing capabilities. We also included two experimental LC-MS data sets acquired from a standard mixture andcell lysates since a simulated data set alone may not capture all the unique characteristics and variability of real spectra needed to assess software performance properly. Our simulated and experimental LC-MS data sets were processed with the MS-DIAL and XCMSOnline software packages and our MVAPACK toolkit to showcase the utility of our data sets to benchmark MVAPACK against community standards. Our results demonstrate the enhanced objectivity and clarity of software assessment that can be achieved when both simulated and experimental data are employed since distinctly different software performances were observed with the simulated and experimental LC-MS data sets. We also demonstrate that the performance of MVAPACK is equivalent to or exceeds existing LC-MS software programs while providing a single platform for processing and analyzing both NMR and MS data sets.
代谢组学通常依赖于使用一维(1D)H NMR 光谱或液相色谱-质谱(LC-MS)从大量生物样本中得出科学见解。NMR 和 MS 代谢组学方法除其他问题外,还需要数据处理管道。由于缺乏具有“已知”结果的标准化数据集,因此难以对这些软件平台的性能进行定量评估。为了解决这个问题,我们创建了一个具有已知峰位置和强度的新型模拟 LC-MS 数据集,定义了组间代谢物差异(即,倍数变化>2,变异系数≤25%),以及不同量的添加高斯噪声(0、5 或 10%)和缺失特征(0、10 或 20%)。该数据集旨在改进现有 LC-MS 代谢组学软件的基准测试,并验证我们的 MVAPACK 软件的更新版本,该版本在其现有的 1D 和二维 NMR 数据处理功能中添加了气相色谱-MS 和 LC-MS 功能。我们还包括两个从标准混合物和细胞裂解物中获得的实验性 LC-MS 数据集,因为仅模拟数据集可能无法捕获适当评估软件性能所需的真实光谱的所有独特特征和可变性。我们的模拟和实验性 LC-MS 数据集使用 MS-DIAL 和 XCMSOnline 软件包以及我们的 MVAPACK 工具包进行处理,以展示我们的数据集对基准测试 MVAPACK 与社区标准的有用性。我们的结果表明,当使用模拟和实验数据时,可以实现软件评估的增强客观性和清晰度,因为在模拟和实验 LC-MS 数据集中观察到明显不同的软件性能。我们还证明,MVAPACK 的性能与现有 LC-MS 软件程序相当或超过,同时为处理和分析 NMR 和 MS 数据集提供了一个单一平台。