LIST, CEA, Laboratory for Data Sciences and Decision (Digiteo), Gif-sur-Yvette, France.
CNRS, Laboratoire de Mathématiques et de leurs Applications de PAU E2S UPPA, Pau, France.
BMC Bioinformatics. 2021 Feb 24;22(1):86. doi: 10.1186/s12859-021-03968-1.
The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level.
Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers.
The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.
越来越多的全基因组关联研究(GWAS)揭示了多个与多种不同表型相关的基因座,表明存在多效性效应。突出这些跨表型遗传关联可以帮助识别和理解某些疾病的共同生物学机制。常见的方法在存在多个独立表型的独立 GWAS 的情况下,在 SNP 水平上测试遗传变异与多个性状之间的关联。在本文中,我们提出了一种新的基因和途径水平的方法,当有几个独立的 GWAS 针对独立的性状时。该方法基于稀疏组偏最小二乘法(sgPLS)的推广,以考虑变量组,并进行链接所有独立数据集的 Lasso 惩罚。这种方法称为联合 sgPLS,能够令人信服地在变量水平和组水平上检测信号。
我们的方法具有提出全局可读模型的优势,同时应对数据结构。它可以优于传统方法,并提供更广泛的先验信息的洞察力。我们在模拟数据上比较了所提出的方法与其他基准方法的性能,并给出了一个应用于真实数据的例子,目的是突出乳腺癌和甲状腺癌的常见易感性变异。
联合 sgPLS 在检测信号方面具有有趣的特性。作为 PLS 的扩展,该方法适用于具有大量变量的数据。Lasso 惩罚的选择应对变量组和观测集的结构。此外,尽管该方法已应用于遗传研究,但它的公式适用于任何具有大量变量和其他应用领域中暴露的先验结构的数据。