AgroParisTech, UMR 1145 Ingénierie Procédés Aliments, 16, rue Claude Bernard, F-75005 Paris, France; INRA, UMR 1145 Ingénierie Procédés Aliments, F-75005 Paris, France.
AgroParisTech, UMR 1145 Ingénierie Procédés Aliments, 16, rue Claude Bernard, F-75005 Paris, France; INRA, UMR 1145 Ingénierie Procédés Aliments, F-75005 Paris, France.
Anal Chim Acta. 2014 Feb 27;813:25-34. doi: 10.1016/j.aca.2014.01.025. Epub 2014 Jan 16.
The integration of multiple data sources has emerged as a pivotal aspect to assess complex systems comprehensively. This new paradigm requires the ability to separate common and redundant from specific and complementary information during the joint analysis of several data blocks. However, inherent problems encountered when analysing single tables are amplified with the generation of multiblock datasets. Finding the relationships between data layers of increasing complexity constitutes therefore a challenging task. In the present work, an algorithm is proposed for the supervised analysis of multiblock data structures. It associates the advantages of interpretability from the orthogonal partial least squares (OPLS) framework and the ability of common component and specific weights analysis (CCSWA) to weight each data table individually in order to grasp its specificities and handle efficiently the different sources of Y-orthogonal variation. Three applications are proposed for illustration purposes. A first example refers to a quantitative structure-activity relationship study aiming to predict the binding affinity of flavonoids toward the P-glycoprotein based on physicochemical properties. A second application concerns the integration of several groups of sensory attributes for overall quality assessment of a series of red wines. A third case study highlights the ability of the method to combine very large heterogeneous data blocks from Omics experiments in systems biology. Results were compared to the reference multiblock partial least squares (MBPLS) method to assess the performance of the proposed algorithm in terms of predictive ability and model interpretability. In all cases, ComDim-OPLS was demonstrated as a relevant data mining strategy for the simultaneous analysis of multiblock structures by accounting for specific variation sources in each dataset and providing a balance between predictive and descriptive purpose.
多源数据的整合已成为全面评估复杂系统的关键因素。这种新的范例要求在联合分析多个数据块时,能够将常见和冗余的信息与特定和互补的信息区分开来。然而,在分析单个表格时遇到的固有问题,在生成多块数据集时会被放大。因此,找到不断增加的复杂数据层之间的关系是一项具有挑战性的任务。在本工作中,提出了一种用于监督分析多块数据结构的算法。它结合了正交偏最小二乘(OPLS)框架的可解释性优势和共同成分和特定权重分析(CCSWA)的能力,以单独为每个数据表加权,从而抓住其特异性,并有效地处理 Y 正交变化的不同来源。提出了三个应用实例来说明。第一个例子是定量结构-活性关系研究,旨在基于物理化学性质预测黄酮类化合物与 P-糖蛋白的结合亲和力。第二个应用涉及几个感官属性组的整合,以全面评估一系列红酒的质量。第三个案例研究强调了该方法在系统生物学中结合来自组学实验的非常大的异构数据块的能力。结果与参考多块偏最小二乘(MBPLS)方法进行了比较,以评估所提出算法在预测能力和模型可解释性方面的性能。在所有情况下,ComDim-OPLS 都被证明是一种有效的数据挖掘策略,用于通过在每个数据集考虑特定的变化来源,同时分析多块结构,并在预测和描述性目的之间取得平衡。