多组大规模两样本表达数据集的一致整合基因集富集分析。

Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.

出版信息

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2164-15-S1-S6. Epub 2014 Jan 24.

Abstract

BACKGROUND

Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment.

METHODS

We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets.

RESULTS

We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method.

CONCLUSIONS

This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.

摘要

背景

基因集富集分析(GSEA)是一种分析途径水平上协调表达变化的重要方法。尽管已经提出了许多统计和计算方法用于 GSEA,但对于多个表达数据集的一致综合 GSEA 问题尚未得到很好的解决。在为相同或相似的研究目的收集的不同相关数据集中,识别具有一致富集的途径或基因集非常重要。

方法

我们将差异表达的潜在真实状态分为三个代表性类别:无变化、正变化和负变化。由于数据噪声,我们从实验中观察到的可能并不表示潜在的真实情况。尽管这些类别在实践中未被观察到,但它们可以在混合模型框架中进行考虑。然后,我们定义了一致基因集富集的数学概念,并基于三成分多元正态混合模型计算其相关概率。可以计算相关的假发现率并用于对不同基因集进行排名。

结果

我们使用三个已发表的肺癌微阵列基因表达数据集来说明我们提出的方法。基于前两个数据集进行了一项分析,以将我们的结果与之前基于单独对每个数据集进行的 GSEA 进行的发表结果进行比较。该比较说明了我们提出的一致综合基因集富集分析的优势。然后,使用相对较新和较大的途径集,我们对前两个数据集以及所有三个数据集进行了综合分析。两个结果都表明,许多基因集可以以较低的假发现率识别出来。这两个结果之间也观察到了一致性。基于 KEGG 癌症途径集的进一步探索表明,我们的方法可以识别出大多数这些途径。

结论

本研究表明,我们可以通过对多个大型两样本基因表达数据集进行一致的综合分析来提高检测能力和发现的一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6951/4046697/6f507c66a750/12864_2014_5679_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索