Suppr超能文献

用于生物数据整合的稀疏典型方法:在一项跨平台研究中的应用

Sparse canonical methods for biological data integration: application to a cross-platform study.

作者信息

Lê Cao Kim-Anh, Martin Pascal G P, Robert-Granié Christèle, Besse Philippe

机构信息

Station d'Amélioration Génétique des Animaux UR 631, Institut National de Recherche Agronomique, F-31326 Castanet, France.

出版信息

BMC Bioinformatics. 2009 Jan 26;10:34. doi: 10.1186/1471-2105-10-34.

Abstract

BACKGROUND

In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.

RESULTS

We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.

CONCLUSION

sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.

摘要

背景

在系统生物学背景下,目前很少有稀疏方法被提出用于整合多个数据集。然而,这是一个重要且基本的问题,在基因组学后研究中,当使用不同平台同时分析转录组学、蛋白质组学和代谢组学数据以理解不同数据集之间的相互作用时,会广泛遇到这个问题。在这种高维环境中,变量选择对于给出可解释的结果至关重要。我们专注于一种稀疏偏最小二乘法(sPLS)来处理双块数据集,其中已知两种变量类型之间的关系是对称的。稀疏偏最小二乘法已针对回归或典型相关框架进行了开发,并包括在整合数据时选择变量的内置程序。为了说明典型模式方法,我们分析了NCI60数据集,其中使用了两种不同平台(cDNA和Affymetrix芯片)来研究六十种癌细胞系的转录组。

结果

我们将所得结果与其他两种稀疏或相关的典型相关方法进行了比较:带弹性网络惩罚的典型相关分析(CCA - EN)和共惯性分析(CIA)。后者不包括用于变量选择的内置程序,并且需要两步分析。我们强调缺乏评估典型相关方法的统计标准,这使得生物学解释对于比较不同的基因选择绝对必要。我们还提出了样本和变量的综合图形表示,以促进结果的解释。

结论

sPLS和CCA - EN从两个数据集中选择了高度相关的基因和互补的发现,这使得能够详细了解几组细胞系的分子特征。发现这两种方法带来了相似的结果,尽管它们以不同的优先级突出了相同的现象。它们优于倾向于选择冗余信息的CIA。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/80e9/2640358/d78de0054100/1471-2105-10-34-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验