Suppr超能文献

基于组稀疏典型相关分析的基因组数据整合。

Group sparse canonical correlation analysis for genomic data integration.

机构信息

Biomedical Engineering Department, Tulane University, New Orleans, LA, USA.

出版信息

BMC Bioinformatics. 2013 Aug 12;14:245. doi: 10.1186/1471-2105-14-245.

Abstract

BACKGROUND

The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).

RESULTS

We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.

CONCLUSIONS

The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.

摘要

背景

来自不同来源和平台的高通量基因组数据集(例如基因表达、单核苷酸多态性(SNP)和拷贝数变异(CNV))的出现极大地增强了我们对这些基因组因素相互作用及其对复杂疾病影响的理解。探索这些不同类型的基因组数据集之间的关系具有挑战性。在本文中,我们专注于一种多元统计方法,即典型相关分析(CCA)方法来解决这个问题。如果数据样本的数量明显少于生物标志物的数量,那么传统的 CCA 方法就无法有效地工作,这是基因组数据(例如 SNP)的典型情况。稀疏 CCA(sCCA)方法被引入以克服这种困难,主要使用 l-1 范数(CCA-l1)或 l-1 和 l-2 范数的组合(CCA-弹性网络)进行惩罚。然而,它们在分析中忽略了基因组数据中的结构或组效应,而这种效应通常存在且很重要(例如,跨越一个基因的 SNP 相互作用并作为一个组一起工作)。

结果

我们提出了一种新的基于组稀疏的 CCA 方法(CCA-sparse group)以及一种有效的数值算法,用于研究两种不同类型的基因组数据(即 SNP 和基因表达)之间的相互关系。然后,我们将模型扩展到更一般的形式,其中可以包括现有的 sCCA 模型。我们将模型应用于两个数据集的特征/变量选择,并在模拟和两个真实数据集(人脑胶质瘤数据和 NCI60 数据)上与现有的 sCCA 方法进行比较。我们使用一对典型变量的样本图形表示来展示所选特征的区分特征。进一步进行了途径分析,以对这些特征进行生物学解释。

结论

CCA-sparse group 方法将特征的组效应纳入相关性分析中,同时同时进行单独的特征选择。在模拟数据上,即使不存在组效应或存在与真正相关特征分组的不相关特征,CCA-sparse group 方法通过识别更多真正相关特征的同时,将总不匹配控制在较低水平,从而优于两种 sCCA 方法(CCA-l1 和 CCA-group)。与我们提出的 CCA-group sparse 模型相比,CCA-l1 倾向于选择较少的真正相关特征,而 CCA-group 倾向于选择更多的冗余特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f95e/3751310/f067adf7e756/1471-2105-14-245-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验