Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London W12 0BZ, U.K.
UK Dementia Research Institute, Imperial College London, London W12 0BZ, U.K.
Anal Chem. 2022 Apr 12;94(14):5493-5503. doi: 10.1021/acs.analchem.1c03592. Epub 2022 Mar 31.
Integration of multiple datasets can greatly enhance bioanalytical studies, for example, by increasing power to discover and validate biomarkers. In liquid chromatography-mass spectrometry (LC-MS) metabolomics, it is especially hard to combine untargeted datasets since the majority of metabolomic features are not annotated and thus cannot be matched by chemical identity. Typically, the information available for each feature is retention time (RT), mass-to-charge ratio (/), and feature intensity (FI). Pairs of features from the same metabolite in separate datasets can exhibit small but significant differences, making matching very challenging. Current methods to address this issue are too simple or rely on assumptions that cannot be met in all cases. We present a method to find feature correspondence between two similar LC-MS metabolomics experiments or batches using only the features' RT, /, and FI. We demonstrate the method on both real and synthetic datasets, using six orthogonal validation strategies to gauge the matching quality. In our main example, 4953 features were uniquely matched, of which 585 (96.8%) of 604 manually annotated features were correct. In a second example, 2324 features could be uniquely matched, with 79 (90.8%) out of 87 annotated features correctly matched. Most of the missed annotated matches are between features that behave very differently from modeled inter-dataset shifts of RT, MZ, and FI. In a third example with simulated data with 4755 features per dataset, 99.6% of the matches were correct. Finally, the results of matching three other dataset pairs using our method are compared with a published alternative method, metabCombiner, showing the advantages of our approach. The method can be applied using M2S (Match 2 Sets), a free, open-source MATLAB toolbox, available at https://github.com/rjdossan/M2S.
多数据集的整合可以极大地增强生物分析研究,例如,通过提高发现和验证生物标志物的能力。在液相色谱-质谱(LC-MS)代谢组学中,由于大多数代谢物特征没有注释,因此无法通过化学同一性进行匹配,因此特别难以结合非靶向数据集。通常,每个特征的可用信息是保留时间(RT)、质荷比(/)和特征强度(FI)。来自单独数据集的同一代谢物的特征对可能表现出微小但显著的差异,使得匹配非常具有挑战性。当前解决此问题的方法过于简单,或者依赖于在所有情况下都无法满足的假设。我们提出了一种仅使用特征的 RT、/ 和 FI 在两个相似的 LC-MS 代谢组学实验或批次之间找到特征对应关系的方法。我们使用六种正交验证策略在真实和合成数据集上演示了该方法,以衡量匹配质量。在我们的主要示例中,有 4953 个特征被唯一匹配,其中 604 个手动注释特征中有 585 个(96.8%)是正确的。在第二个示例中,有 2324 个特征可以被唯一匹配,其中 79 个(90.8%)注释特征被正确匹配。大多数错过的注释匹配是在特征之间,这些特征的行为与 RT、MZ 和 FI 的数据集间偏移的模型非常不同。在具有每个数据集 4755 个特征的模拟数据的第三个示例中,99.6%的匹配是正确的。最后,使用我们的方法匹配其他三个数据集对的结果与已发表的替代方法 metabCombiner 进行了比较,展示了我们方法的优势。该方法可以使用 M2S(Match 2 Sets)应用,M2S 是一个免费的开源 MATLAB 工具箱,可在 https://github.com/rjdossan/M2S 上获得。