Department of Microbiology and Immunology, McGill University, Quebec, Canada.
Institute of Parasitology, McGill University, Quebec, Canada.
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac553.
Global or untargeted metabolomics is widely used to comprehensively investigate metabolic profiles under various pathophysiological conditions such as inflammations, infections, responses to exposures or interactions with microbial communities. However, biological interpretation of global metabolomics data remains a daunting task. Recent years have seen growing applications of pathway enrichment analysis based on putative annotations of liquid chromatography coupled with mass spectrometry (LC-MS) peaks for functional interpretation of LC-MS-based global metabolomics data. However, due to intricate peak-metabolite and metabolite-pathway relationships, considerable variations are observed among results obtained using different approaches. There is an urgent need to benchmark these approaches to inform the best practices.
We have conducted a benchmark study of common peak annotation approaches and pathway enrichment methods in current metabolomics studies. Representative approaches, including three peak annotation methods and four enrichment methods, were selected and benchmarked under different scenarios. Based on the results, we have provided a set of recommendations regarding peak annotation, ranking metrics and feature selection. The overall better performance was obtained for the mummichog approach. We have observed that a 30% annotation rate is sufficient to achieve high recall (90% based on mummichog), and using semi-annotated data improves functional interpretation. Based on the current platforms and enrichment methods, we further propose an identifiability index to indicate the possibility of a pathway being reliably identified. Finally, we evaluated all methods using 11 COVID-19 and 8 inflammatory bowel diseases (IBD) global metabolomics datasets.
全球或非靶向代谢组学广泛用于全面研究各种病理生理条件下的代谢谱,如炎症、感染、对暴露的反应或与微生物群落的相互作用。然而,对全局代谢组学数据的生物学解释仍然是一项艰巨的任务。近年来,基于液相色谱与质谱(LC-MS)峰的假定注释,对途径富集分析在基于 LC-MS 的全局代谢组学数据的功能解释中的应用越来越多。然而,由于峰-代谢物和代谢物-途径关系错综复杂,不同方法得到的结果存在相当大的差异。迫切需要对这些方法进行基准测试,以提供最佳实践的信息。
我们对当前代谢组学研究中常见的峰注释方法和途径富集方法进行了基准研究。选择了具有代表性的方法,包括三种峰注释方法和四种富集方法,并在不同情况下对其进行了基准测试。基于结果,我们提供了一组关于峰注释、排序指标和特征选择的建议。总体而言,mummichog 方法的性能更好。我们观察到,达到高召回率(基于 mummichog 约 90%)需要约 30%的注释率,并且使用半注释数据可以提高功能解释。基于当前的平台和富集方法,我们进一步提出了一个可识别性指数来表示途径被可靠识别的可能性。最后,我们使用 11 个 COVID-19 和 8 个炎症性肠病(IBD)全局代谢组学数据集评估了所有方法。