Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA), Université Catholique de Louvain (UCL), Voie du Roman Pays 20, bte L1.04.01, 1348, Louvain-la-Neuve, Belgium.
Machine Learning Group, Université Catholique de Louvain (UCL), Louvain-la-Neuve, Belgium.
Metabolomics. 2019 Apr 16;15(4):63. doi: 10.1007/s11306-019-1524-3.
The pre-processing of analytical data in metabolomics must be considered as a whole to allow the construction of a global and unique object for any further simultaneous data analysis or multivariate statistical modelling. For 1D H-NMR metabolomics experiments, best practices for data pre-processing are well defined, but not yet for 2D experiments (for instance COSY in this paper).
By considering the added value of a second dimension, the objective is to propose two workflows dedicated to 2D NMR data handling and preparation (the Global Peak List and Vectorization approaches) and to compare them (with respect to each other and with 1D standards). This will allow to detect which methodology is the best in terms of amount of metabolomic content and to explore the advantages of the selected workflow in distinguishing among treatment groups and identifying relevant biomarkers. Therefore, this paper explores both the necessity of novel 2D pre-processing workflows, the evaluation of their quality and the evaluation of their performance in the subsequent determination of accurate (2D) biomarkers.
To select the more informative data source, MIC (Metabolomic Informative Content) indexes are used, based on clustering and inertia measures of quality. Then, to highlight biomarkers or critical spectral zones, the PLS-DA model is used, along with more advanced sparse algorithms (sPLS and L-sOPLS).
Results are discussed according to two different experimental designs (one which is unsupervised and based on human urine samples, and the other which is controlled and based on spiked serum media). MIC indexes are shown, leading to the choice of the more relevant workflow to use thereafter. Finally, biomarkers are provided for each case and the predictive power of each candidate model is assessed with cross-validated measures of RMSEP.
In conclusion, it is shown that no solution can be universally the best in every case, but that 2D experiments allow to clearly find relevant cross peak biomarkers even with a poor initial separability between groups. The MIC measures linked with the candidate workflows (2D GPL, 2D vectorization, 1D, and with specific parameters) lead to visualize which data set must be used as a priority to more easily find biomarkers. The diversity of data sources, mainly 1D versus 2D, may often lead to complementary or confirmatory results.
代谢组学分析数据的预处理必须被视为一个整体,以便为任何进一步的同时数据分析或多变量统计建模构建一个全局且唯一的对象。对于 1D H-NMR 代谢组学实验,数据预处理的最佳实践已经得到很好的定义,但对于 2D 实验(例如本文中的 COSY)还没有。
通过考虑二维的附加值,目的是提出两种专门用于 2D NMR 数据处理和准备的工作流程(全局峰列表和矢量化方法),并对其进行比较(彼此之间以及与 1D 标准)。这将能够检测到在代谢组学含量方面哪种方法最好,并探索所选工作流程在区分处理组和识别相关生物标志物方面的优势。因此,本文探讨了新颖的 2D 预处理工作流程的必要性、其质量的评估以及在后续确定准确(2D)生物标志物中的性能评估。
为了选择更具信息量的数据源,使用基于聚类和质量惯性度量的 MIC(代谢组学信息量)指数。然后,为了突出生物标志物或关键光谱区域,使用 PLS-DA 模型以及更先进的稀疏算法(sPLS 和 L-sOPLS)。
根据两种不同的实验设计(一种是无监督的,基于人类尿液样本,另一种是对照的,基于加标血清介质)讨论结果。显示 MIC 指数,从而选择此后使用的更相关的工作流程。最后,为每种情况提供生物标志物,并使用 RMSEP 的交叉验证措施评估每个候选模型的预测能力。
总之,结果表明,没有一种解决方案可以在每种情况下都普遍最好,但 2D 实验允许即使在组之间初始可分离性较差的情况下,也可以清楚地找到相关的交叉峰生物标志物。与候选工作流程(2D GPL、2D 矢量化、1D 以及具有特定参数)相关联的 MIC 测量结果导致可视化必须优先使用哪个数据集以便更轻松地找到生物标志物。数据源的多样性,主要是 1D 与 2D,通常可能导致互补或确认的结果。