一种多模态数据整合策略：在脊髓小脑共济失调中生物标志物识别的应用。

A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia.

机构信息

Bioinformatics and Biostatistics Core Facility of the Brain and Spine Institute, La Pitié-Salpêtriére Hospital, Paris, France.

Pierre and Marie Curie University.

出版信息

Brief Bioinform. 2018 Nov 27;19(6):1356-1369. doi: 10.1093/bib/bbx060.

DOI:10.1093/bib/bbx060

PMID:29106465

Abstract

The growing number of modalities (e.g. multi-omics, imaging and clinical data) characterizing a given disease provides physicians and statisticians with complementary facets reflecting the disease process but emphasizes the need for novel statistical methods of data analysis able to unify these views. Such data sets are indeed intrinsically structured in blocks, where each block represents a set of variables observed on a group of individuals. Therefore, classical statistical tools cannot be applied without altering their organization, with the risk of information loss. Regularized generalized canonical correlation analysis (RGCCA) and its sparse generalized canonical correlation analysis (SGCCA) counterpart are component-based methods for exploratory analyses of data sets structured in blocks of variables. Rather than operating sequentially on parts of the measurements, the RGCCA/SGCCA-based integrative analysis method aims at summarizing the relevant information between and within the blocks. It processes a priori information defining which blocks are supposed to be linked to one another, thus reflecting hypotheses about the biology underlying the data blocks. It also requires the setting of extra parameters that need to be carefully adjusted.Here, we provide practical guidelines for the use of RGCCA/SGCCA. We also illustrate the flexibility and usefulness of RGCCA/SGCCA on a unique cohort of patients with four genetic subtypes of spinocerebellar ataxia, in which we obtained multiple data sets from brain volumetry and magnetic resonance spectroscopy, and metabolomic and lipidomic analyses. As a first step toward the extraction of multimodal biomarkers, and through the reduction to a few meaningful components and the visualization of relevant variables, we identified possible markers of disease progression.

摘要

越来越多的方法（例如多组学、成像和临床数据）可以描述特定的疾病，这为医生和统计学家提供了互补的方面，反映了疾病的过程，但强调需要新的统计数据分析方法来统一这些观点。这些数据集本质上是按块结构化的，其中每个块代表在一组个体上观察到的一组变量。因此，如果不改变它们的组织，就不能应用经典的统计工具，存在信息丢失的风险。正则化广义典型相关分析（RGCCA）及其稀疏广义典型相关分析（SGCCA）对应物是用于对按变量块结构化的数据进行探索性分析的基于组件的方法。基于 RGCCA/SGCCA 的综合分析方法不是按顺序对测量的部分进行操作，而是旨在总结块之间和块内的相关信息。它处理先验信息，定义应该相互关联的块，从而反映关于数据块背后生物学的假设。它还需要设置需要仔细调整的额外参数。在这里，我们提供了使用 RGCCA/SGCCA 的实用指南。我们还在一个独特的小脑共济失调患者队列上说明了 RGCCA/SGCCA 的灵活性和有用性，在该队列中，我们从脑容积测量、磁共振波谱、代谢组学和脂质组学分析中获得了多个数据集。作为提取多模态生物标志物的第一步，并且通过减少到几个有意义的成分和可视化相关变量，我们确定了疾病进展的可能标志物。