School of Medicine, Technical University of Munich, Germany.
Ludwig Maximilian University of Munich, Germany.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab535.
Large metabolomics datasets inevitably contain unwanted technical variations which can obscure meaningful biological signals and affect how this information is applied to personalized healthcare. Many methods have been developed to handle unwanted variations. However, the underlying assumptions of many existing methods only hold for a few specific scenarios. Some tools remove technical variations with models trained on quality control (QC) samples which may not generalize well on subject samples. Additionally, almost none of the existing methods supports datasets with multiple types of QC samples, which greatly limits their performance and flexibility. To address these issues, a non-parametric method TIGER (Technical variation elImination with ensemble learninG architEctuRe) is developed in this study and released as an R package (https://CRAN.R-project.org/package=TIGERr). TIGER integrates the random forest algorithm into an adaptable ensemble learning architecture. Evaluation results show that TIGER outperforms four popular methods with respect to robustness and reliability on three human cohort datasets constructed with targeted or untargeted metabolomics data. Additionally, a case study aiming to identify age-associated metabolites is performed to illustrate how TIGER can be used for cross-kit adjustment in a longitudinal analysis with experimental data of three time-points generated by different analytical kits. A dynamic website is developed to help evaluate the performance of TIGER and examine the patterns revealed in our longitudinal analysis (https://han-siyu.github.io/TIGER_web/). Overall, TIGER is expected to be a powerful tool for metabolomics data analysis.
大型代谢组学数据集不可避免地包含不需要的技术变化,这些变化可能会掩盖有意义的生物学信号,并影响如何将这些信息应用于个性化医疗保健。已经开发了许多方法来处理不需要的变化。然而,许多现有方法的基本假设仅适用于少数特定情况。一些工具使用在质量控制 (QC) 样本上训练的模型来去除技术变化,但在主体样本上可能无法很好地概括。此外,几乎没有现有的方法支持具有多种类型 QC 样本的数据集,这极大地限制了它们的性能和灵活性。为了解决这些问题,本研究开发了一种非参数方法 TIGER(使用集成学习架构消除技术变化),并作为 R 包发布(https://CRAN.R-project.org/package=TIGERr)。TIGER 将随机森林算法集成到可适应的集成学习架构中。评估结果表明,TIGER 在三个使用靶向或非靶向代谢组学数据构建的人类队列数据集上,在稳健性和可靠性方面优于四种流行方法。此外,还进行了一项旨在识别与年龄相关的代谢物的案例研究,以说明 TIGER 如何用于在使用三个不同分析试剂盒生成的三个时间点的实验数据的纵向分析中进行跨试剂盒调整。开发了一个动态网站来帮助评估 TIGER 的性能,并检查我们在纵向分析中揭示的模式(https://han-siyu.github.io/TIGER_web/)。总的来说,TIGER 有望成为代谢组学数据分析的有力工具。