Deng Linsui, Tang Yanlin, Zhang Xianyang, Chen Jun
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China.
School of Statistics, East China Normal University, Shanghai, China.
Front Genet. 2024 Nov 20;15:1489694. doi: 10.3389/fgene.2024.1489694. eCollection 2024.
Sparse canonical correlation analysis (sCCA) has been a useful approach for integrating different high-dimensional datasets by finding a subset of correlated features that explain the most correlation in the data. In the context of microbiome studies, investigators are always interested in knowing how the microbiome interacts with the host at different molecular levels such as genome, methylol, transcriptome, metabolome and proteome. sCCA provides a simple approach for exploiting the correlation structure among multiple omics data and finding a set of correlated omics features, which could contribute to understanding the host-microbiome interaction. However, existing sCCA methods do not address compositionality, and its application to microbiome data is thus not optimal. This paper proposes a new sCCA framework for integrating microbiome data with other high-dimensional omics data, accounting for the compositional nature of microbiome sequencing data. It also allows integrating prior structure information such as the grouping structure among bacterial taxa by imposing a "soft" constraint on the coefficients through varying penalization strength. As a result, the method provides significant improvement when the structure is informative while maintaining robustness against a misspecified structure. Through extensive simulation studies and real data analysis, we demonstrate the superiority of the proposed framework over the state-of-the-art approaches.
稀疏典型相关分析(sCCA)是一种通过寻找相关特征的子集来整合不同高维数据集的有效方法,这些相关特征能够解释数据中最大程度的相关性。在微生物组研究的背景下,研究人员一直对了解微生物组如何在基因组、甲基化组、转录组、代谢组和蛋白质组等不同分子水平上与宿主相互作用感兴趣。sCCA提供了一种利用多个组学数据之间的相关结构并找到一组相关组学特征的简单方法,这有助于理解宿主-微生物组的相互作用。然而,现有的sCCA方法没有解决数据的组成性问题,因此其在微生物组数据上的应用并不理想。本文提出了一种新的sCCA框架,用于将微生物组数据与其他高维组学数据进行整合,同时考虑到微生物组测序数据的组成性质。它还允许通过对系数施加“软”约束来整合先验结构信息,例如细菌分类群之间的分组结构,通过改变惩罚强度来实现。结果,当结构信息丰富时,该方法能显著改进,同时保持对错误指定结构的稳健性。通过广泛的模拟研究和实际数据分析,我们证明了所提出框架相对于现有最先进方法的优越性。