Suppr超能文献

一种高效的多个大规模两样本表达数据集的一致性综合分析方法。

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

机构信息

Department of Statistics, The George Washington University, Washington, DC 20052, USA.

Department of Pharmacology and Physiology.

出版信息

Bioinformatics. 2017 Dec 1;33(23):3852-3860. doi: 10.1093/bioinformatics/btx061.

Abstract

MOTIVATION

We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.

RESULTS

In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.

AVAILABILITY AND IMPLEMENTATION

Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance.

CONTACT

ylai@gwu.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

我们提出了一种基于混合模型的方法,用于对多个大规模两样本表达数据集进行一致的综合分析。由于混合模型基于转换后的差异表达检验 P 值(z 值),因此它通常适用于由微阵列或 RNA-seq 平台生成的表达数据。混合模型很简单,每个数据集有三个正态分布分量,分别表示下调、上调和无差异表达。然而,当数据集数量增加时,由于来自不同数据集的分量组合,模型参数空间呈指数增长。

结果

在这项研究中,受用于纵向数据分析的著名广义估计方程(GEE)的启发,我们关注一致分量,并假设非一致分量的比例遵循特殊结构。我们讨论了模型简化的可交换、多集系数和自回归结构,以及它们相关的期望最大化(EM)算法。然后,参数空间与数据集的数量呈线性关系。在我们之前的研究中,我们已经将通用混合模型应用于三个用于肺癌研究的微阵列数据集。我们表明,具有可交换结构的简化混合模型可以检测到更多的基因集(或途径)。此外,我们还表明,简化模型也可以检测到更多的基因。癌症基因组图谱(TCGA)数据已被越来越多地收集。基于用于研究两种密切相关的癌症的 TCGA RNA 测序数据,已经清楚地证明了结合一致性特征的优势。

可用性和实现

补充文件中包含了其他结果。计算机程序 R 函数可在 http://home.gwu.edu/∼ylai/research/Concordance 上免费获得。

联系方式

ylai@gwu.edu

补充信息

补充数据可在生物信息学在线获得。

相似文献

7
A GMM-IG framework for selecting genes as expression panel biomarkers.一种用于选择基因作为表达谱生物标志物的 GMM-IG 框架。
Artif Intell Med. 2010 Feb-Mar;48(2-3):75-82. doi: 10.1016/j.artmed.2009.07.006. Epub 2009 Dec 8.
8
Modeling nonlinearity in dilution design microarray data.稀释设计微阵列数据中的非线性建模
Bioinformatics. 2007 Jun 1;23(11):1339-47. doi: 10.1093/bioinformatics/btm002. Epub 2007 Jan 19.
10
Bayesian mixture model based clustering of replicated microarray data.基于贝叶斯混合模型的重复微阵列数据聚类
Bioinformatics. 2004 May 22;20(8):1222-32. doi: 10.1093/bioinformatics/bth068. Epub 2004 Feb 10.

本文引用的文献

8
Meta-analysis for pathway enrichment analysis when combining multiple genomic studies.多组学研究整合的通路富集分析的元分析
Bioinformatics. 2010 May 15;26(10):1316-23. doi: 10.1093/bioinformatics/btq148. Epub 2010 Apr 21.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验