Machine Intelligence Department, Simula Metropolitan Center for Digital Engineering, Oslo, Norway.
Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands.
BMC Bioinformatics. 2022 Jan 10;23(1):31. doi: 10.1186/s12859-021-04550-5.
Analysis of dynamic metabolomics data holds the promise to improve our understanding of underlying mechanisms in metabolism. For example, it may detect changes in metabolism due to the onset of a disease. Dynamic or time-resolved metabolomics data can be arranged as a three-way array with entries organized according to a subjects mode, a metabolites mode and a time mode. While such time-evolving multiway data sets are increasingly collected, revealing the underlying mechanisms and their dynamics from such data remains challenging. For such data, one of the complexities is the presence of a superposition of several sources of variation: induced variation (due to experimental conditions or inborn errors), individual variation, and measurement error. Multiway data analysis (also known as tensor factorizations) has been successfully used in data mining to find the underlying patterns in multiway data. To explore the performance of multiway data analysis methods in terms of revealing the underlying mechanisms in dynamic metabolomics data, simulated data with known ground truth can be studied.
We focus on simulated data arising from different dynamic models of increasing complexity, i.e., a simple linear system, a yeast glycolysis model, and a human cholesterol model. We generate data with induced variation as well as individual variation. Systematic experiments are performed to demonstrate the advantages and limitations of multiway data analysis in analyzing such dynamic metabolomics data and their capacity to disentangle the different sources of variations. We choose to use simulations since we want to understand the capability of multiway data analysis methods which is facilitated by knowing the ground truth.
Our numerical experiments demonstrate that despite the increasing complexity of the studied dynamic metabolic models, tensor factorization methods CANDECOMP/PARAFAC(CP) and Parallel Profiles with Linear Dependences (Paralind) can disentangle the sources of variations and thereby reveal the underlying mechanisms and their dynamics.
分析动态代谢组学数据有望增进我们对代谢中潜在机制的理解。例如,它可以检测由于疾病发作而导致的代谢变化。动态或时间分辨代谢组学数据可以排列为具有三向阵列的形式,其中的条目根据主体模式、代谢物模式和时间模式进行组织。虽然这种时间演变的多向数据集越来越多地被收集,但从这些数据中揭示潜在机制及其动态仍然具有挑战性。对于这种数据,其中一个复杂性是存在多种来源的变化的叠加:诱导变化(由于实验条件或先天错误)、个体变化和测量误差。多向数据分析(也称为张量分解)已成功用于数据挖掘,以找到多向数据中的潜在模式。为了探索多向数据分析方法在揭示动态代谢组学数据中的潜在机制方面的性能,可以研究具有已知真实情况的模拟数据。
我们专注于来自不同动态模型的模拟数据,这些模型的复杂性递增,即简单的线性系统、酵母糖酵解模型和人类胆固醇模型。我们生成具有诱导变化和个体变化的数据。系统实验旨在展示多向数据分析在分析这种动态代谢组学数据方面的优势和局限性及其分离不同变化源的能力。我们选择使用模拟数据,因为我们希望了解多向数据分析方法的能力,这得益于对真实情况的了解。
我们的数值实验表明,尽管所研究的动态代谢模型的复杂性不断增加,但张量分解方法 CANDECOMP/PARAFAC(CP)和具有线性依赖关系的并行剖面(Paralind)可以分离变化源,从而揭示潜在机制及其动态。