Gu Zhujie, El Bouhaddani Said, Pei Jiayi, Houwing-Duistermaat Jeanine, Uh Hae-Won
Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands.
Department of Cardiology, UMC Utrecht, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands.
BMC Bioinformatics. 2021 Mar 18;22(1):131. doi: 10.1186/s12859-021-03958-3.
Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace.
The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease.
GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
如今,人们在相同样本上测量多种组学数据,认为这些不同的组学数据集代表了潜在生物系统的各个方面。整合这些组学数据集将有助于理解该系统。为此,已提出了各种方法,如偏最小二乘法(PLS),将两个数据集分解为联合子空间和残差子空间。由于组学数据具有异质性,PLS中的联合成分将包含每个数据集特有的变异。为了解决这个问题,双向正交偏最小二乘法(O2PLS)通过引入正交子空间来捕捉异质性,并能更好地估计联合子空间。然而,O2PLS中跨越联合子空间的潜在成分是所有变量的线性组合,而识别与研究问题相关的小子集可能会很有意义。为了获得稀疏性,我们将O2PLS扩展为组稀疏O2PLS(GO2PLS),它利用变量间组结构的生物学信息,并在联合子空间中进行组选择。
模拟研究表明,引入稀疏性提高了特征选择性能。此外,纳入组结构增加了特征选择过程的稳健性。GO2PLS在联合得分估计、联合载荷估计和特征选择的准确性方面表现最佳。我们将GO2PLS应用于两项研究的数据集:TwinsUK(一项人群研究)和CVON-DOSIS(一项小型病例对照研究)。在第一项研究中,我们在将甲基化数据集与IgG糖组学数据整合时,纳入了甲基化CpG位点组结构的生物学信息。所选甲基化组的靶向基因结果与免疫系统相关,其中IgG聚糖起着重要作用。在第二项研究中,我们选择了解释调控组学和转录组学数据之间协方差的调控区域和转录本。所选特征的相应基因似乎与心肌疾病相关。
GO2PLS整合两个组学数据集,以帮助理解涉及两个组学水平的潜在系统。它纳入外部组信息并进行组选择,从而产生一小部分能最佳解释两个组学数据集之间关系的特征,以实现更好的可解释性。