一种高效的多个大规模两样本表达数据集的一致性综合分析方法。

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.

机构信息

Department of Statistics, The George Washington University, Washington, DC 20052, USA.

Department of Pharmacology and Physiology.

出版信息

Bioinformatics. 2017 Dec 1;33(23):3852-3860. doi: 10.1093/bioinformatics/btx061.

DOI:10.1093/bioinformatics/btx061

PMID:28174897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5860313/

Abstract

MOTIVATION

We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.

RESULTS

In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.

AVAILABILITY AND IMPLEMENTATION

Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance.

CONTACT

ylai@gwu.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

我们提出了一种基于混合模型的方法，用于对多个大规模两样本表达数据集进行一致的综合分析。由于混合模型基于转换后的差异表达检验 P 值（z 值），因此它通常适用于由微阵列或 RNA-seq 平台生成的表达数据。混合模型很简单，每个数据集有三个正态分布分量，分别表示下调、上调和无差异表达。然而，当数据集数量增加时，由于来自不同数据集的分量组合，模型参数空间呈指数增长。

结果

在这项研究中，受用于纵向数据分析的著名广义估计方程（GEE）的启发，我们关注一致分量，并假设非一致分量的比例遵循特殊结构。我们讨论了模型简化的可交换、多集系数和自回归结构，以及它们相关的期望最大化（EM）算法。然后，参数空间与数据集的数量呈线性关系。在我们之前的研究中，我们已经将通用混合模型应用于三个用于肺癌研究的微阵列数据集。我们表明，具有可交换结构的简化混合模型可以检测到更多的基因集（或途径）。此外，我们还表明，简化模型也可以检测到更多的基因。癌症基因组图谱（TCGA）数据已被越来越多地收集。基于用于研究两种密切相关的癌症的 TCGA RNA 测序数据，已经清楚地证明了结合一致性特征的优势。

可用性和实现

补充文件中包含了其他结果。计算机程序 R 函数可在 http://home.gwu.edu/∼ylai/research/Concordance 上免费获得。

联系方式

ylai@gwu.edu。

补充信息

补充数据可在生物信息学在线获得。

相似文献

An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets.一种高效的多个大规模两样本表达数据集的一致性综合分析方法。

Bioinformatics. 2017 Dec 1;33(23):3852-3860. doi: 10.1093/bioinformatics/btx061.

Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.多组大规模两样本表达数据集的一致整合基因集富集分析。

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2164-15-S1-S6. Epub 2014 Jan 24.

Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling.整合RNA测序数据与异质性微阵列数据用于乳腺癌分析。

BMC Bioinformatics. 2017 Nov 21;18(1):506. doi: 10.1186/s12859-017-1925-0.

A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups.一种用于两个两组样本大规模实验之间一致性和不一致性检验的混合模型方法。

Bioinformatics. 2007 May 15;23(10):1243-50. doi: 10.1093/bioinformatics/btm103. Epub 2007 Mar 23.

A censored beta mixture model for the estimation of the proportion of non-differentially expressed genes.一种用于估计非差异表达基因比例的有偏 beta 混合模型。

Bioinformatics. 2010 Mar 1;26(5):640-6. doi: 10.1093/bioinformatics/btq001. Epub 2010 Jan 15.

An ensemble approach to microarray data-based gene prioritization after missing value imputation.一种在缺失值插补后基于微阵列数据进行基因优先级排序的集成方法。

Bioinformatics. 2007 Mar 15;23(6):747-54. doi: 10.1093/bioinformatics/btm010. Epub 2007 Jan 31.

A GMM-IG framework for selecting genes as expression panel biomarkers.一种用于选择基因作为表达谱生物标志物的 GMM-IG 框架。

Artif Intell Med. 2010 Feb-Mar;48(2-3):75-82. doi: 10.1016/j.artmed.2009.07.006. Epub 2009 Dec 8.

Modeling nonlinearity in dilution design microarray data.稀释设计微阵列数据中的非线性建模

Bioinformatics. 2007 Jun 1;23(11):1339-47. doi: 10.1093/bioinformatics/btm002. Epub 2007 Jan 19.

Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data.特征特异性分位数归一化可使用基因表达数据对分子亚型进行跨平台分类。

Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.

Bayesian mixture model based clustering of replicated microarray data.基于贝叶斯混合模型的重复微阵列数据聚类

Bioinformatics. 2004 May 22;20(8):1222-32. doi: 10.1093/bioinformatics/bth068. Epub 2004 Feb 10.

引用本文的文献

An order-preserving batch-effect correction method based on a monotonic deep learning framework.一种基于单调深度学习框架的保序批效应校正方法。

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf247.

Deciphering gene expression patterns using large-scale transcriptomic data and its applications.解析大规模转录组数据中的基因表达模式及其应用。

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae590.

A Novel Joint Gene Set Analysis Framework Improves Identification of Enriched Pathways in Cross Disease Transcriptomic Analysis.一种新型联合基因集分析框架改进了跨疾病转录组分析中富集通路的识别。

Front Genet. 2019 Apr 12;10:293. doi: 10.3389/fgene.2019.00293. eCollection 2019.

Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions.用于预测lncRNA-蛋白质相互作用的基于快速核学习到核岭回归的多变量信息融合

Front Genet. 2019 Jan 15;9:716. doi: 10.3389/fgene.2018.00716. eCollection 2018.

A novel joint analysis framework improves identification of differentially expressed genes in cross disease transcriptomic analysis.一种新型联合分析框架改进了跨疾病转录组分析中差异表达基因的识别。

BioData Min. 2018 Feb 20;11:3. doi: 10.1186/s13040-018-0163-y. eCollection 2018.

本文引用的文献

Interaction of cytokeratin 19 head domain and HER2 in the cytoplasm leads to activation of HER2-Erk pathway.细胞角蛋白 19 头区与 HER2 在细胞质中的相互作用导致 HER2-Erk 通路的激活。

Sci Rep. 2016 Dec 23;6:39557. doi: 10.1038/srep39557.

Rnd3 in Cancer: A Review of the Evidence for Tumor Promoter or Suppressor.Rnd3在癌症中的作用：肿瘤促进因子或抑制因子的证据综述

Mol Cancer Res. 2016 Nov;14(11):1033-1044. doi: 10.1158/1541-7786.MCR-16-0164. Epub 2016 Aug 23.

Large-scale genome-wide association studies and meta-analyses of longitudinal change in adult lung function.成人肺功能纵向变化的大规模全基因组关联研究和荟萃分析。

PLoS One. 2014 Jul 1;9(7):e100776. doi: 10.1371/journal.pone.0100776. eCollection 2014.

Pleiotropic associations of risk variants identified for other cancers with lung cancer risk: the PAGE and TRICL consortia.其他癌症风险变异与肺癌风险的多效关联：PAGE 和 TRICL 联盟。

J Natl Cancer Inst. 2014 Apr;106(4):dju061. doi: 10.1093/jnci/dju061. Epub 2014 Mar 28.

Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.多组大规模两样本表达数据集的一致整合基因集富集分析。

BMC Genomics. 2014;15 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2164-15-S1-S6. Epub 2014 Jan 24.

A powerful Bayesian meta-analysis method to integrate multiple gene set enrichment studies.一种强大的贝叶斯元分析方法，用于整合多个基因集富集研究。

Bioinformatics. 2013 Apr 1;29(7):862-9. doi: 10.1093/bioinformatics/btt068. Epub 2013 Feb 15.

Gene set analysis methods: statistical models and methodological differences.基因集分析方法：统计模型与方法差异

Brief Bioinform. 2014 Jul;15(4):504-18. doi: 10.1093/bib/bbt002.

Meta-analysis for pathway enrichment analysis when combining multiple genomic studies.多组学研究整合的通路富集分析的元分析

Bioinformatics. 2010 May 15;26(10):1316-23. doi: 10.1093/bioinformatics/btq148. Epub 2010 Apr 21.

A statistical framework for integrating two microarray data sets in differential expression analysis.一种用于在差异表达分析中整合两个微阵列数据集的统计框架。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S23. doi: 10.1186/1471-2105-10-S1-S23.

Meta-analysis of age-related gene expression profiles identifies common signatures of aging.与年龄相关基因表达谱的荟萃分析确定了衰老的共同特征。

Bioinformatics. 2009 Apr 1;25(7):875-81. doi: 10.1093/bioinformatics/btp073. Epub 2009 Feb 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。