Liquet Benoît, de Micheaux Pierre Lafaye, Hejblum Boris P, Thiébaut Rodolphe
School of Mathematics and Physics, The University of Queensland, Brisbane 4066, Australia, ARC Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Brisbane, Australia.
CREST, ENSAI, Campus de Ker-Lann, Rue Blaise Pascal, BP 37203, 35172 Bruz cedex, France.
Bioinformatics. 2016 Jan 1;32(1):35-42. doi: 10.1093/bioinformatics/btv535. Epub 2015 Sep 10.
The association between two blocks of 'omics' data brings challenging issues in computational biology due to their size and complexity. Here, we focus on a class of multivariate statistical methods called partial least square (PLS). Sparse version of PLS (sPLS) operates integration of two datasets while simultaneously selecting the contributing variables. However, these methods do not take into account the important structural or group effects due to the relationship between markers among biological pathways. Hence, considering the predefined groups of markers (e.g. genesets), this could improve the relevance and the efficacy of the PLS approach.
We propose two PLS extensions called group PLS (gPLS) and sparse gPLS (sgPLS). Our algorithm enables to study the relationship between two different types of omics data (e.g. SNP and gene expression) or between an omics dataset and multivariate phenotypes (e.g. cytokine secretion). We demonstrate the good performance of gPLS and sgPLS compared with the sPLS in the context of grouped data. Then, these methods are compared through an HIV therapeutic vaccine trial. Our approaches provide parsimonious models to reveal the relationship between gene abundance and the immunological response to the vaccine.
The approach is implemented in a comprehensive R package called sgPLS available on the CRAN.
Supplementary data are available at Bioinformatics online.
由于“组学”数据的两个模块规模大且复杂,它们之间的关联给计算生物学带来了具有挑战性的问题。在这里,我们专注于一类称为偏最小二乘法(PLS)的多元统计方法。PLS的稀疏版本(sPLS)在整合两个数据集的同时选择起作用的变量。然而,由于生物途径中标记之间的关系,这些方法没有考虑重要的结构或组效应。因此,考虑标记的预定义组(例如基因集),这可以提高PLS方法的相关性和有效性。
我们提出了两种PLS扩展方法,称为组PLS(gPLS)和稀疏gPLS(sgPLS)。我们的算法能够研究两种不同类型的组学数据(例如SNP和基因表达)之间或组学数据集与多元表型(例如细胞因子分泌)之间的关系。我们证明了在分组数据的情况下,与sPLS相比,gPLS和sgPLS具有良好的性能。然后,通过一项HIV治疗性疫苗试验对这些方法进行比较。我们的方法提供了简洁的模型来揭示基因丰度与疫苗免疫反应之间的关系。
该方法在CRAN上提供的一个名为sgPLS的综合R包中实现。
补充数据可在《生物信息学》在线获取。