Department of Statistics, The George Washington University, Washington, DC 20052, USA.
Department of Pharmacology and Physiology.
Bioinformatics. 2017 Dec 1;33(23):3852-3860. doi: 10.1093/bioinformatics/btx061.
We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.
In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.
Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance.
Supplementary data are available at Bioinformatics online.
我们提出了一种基于混合模型的方法,用于对多个大规模两样本表达数据集进行一致的综合分析。由于混合模型基于转换后的差异表达检验 P 值(z 值),因此它通常适用于由微阵列或 RNA-seq 平台生成的表达数据。混合模型很简单,每个数据集有三个正态分布分量,分别表示下调、上调和无差异表达。然而,当数据集数量增加时,由于来自不同数据集的分量组合,模型参数空间呈指数增长。
在这项研究中,受用于纵向数据分析的著名广义估计方程(GEE)的启发,我们关注一致分量,并假设非一致分量的比例遵循特殊结构。我们讨论了模型简化的可交换、多集系数和自回归结构,以及它们相关的期望最大化(EM)算法。然后,参数空间与数据集的数量呈线性关系。在我们之前的研究中,我们已经将通用混合模型应用于三个用于肺癌研究的微阵列数据集。我们表明,具有可交换结构的简化混合模型可以检测到更多的基因集(或途径)。此外,我们还表明,简化模型也可以检测到更多的基因。癌症基因组图谱(TCGA)数据已被越来越多地收集。基于用于研究两种密切相关的癌症的 TCGA RNA 测序数据,已经清楚地证明了结合一致性特征的优势。
补充文件中包含了其他结果。计算机程序 R 函数可在 http://home.gwu.edu/∼ylai/research/Concordance 上免费获得。
补充数据可在生物信息学在线获得。