分群解耦在基因表达谱数据分析中的多基因分析

Partition decoupling for multi-gene analysis of gene expression profiling data.

机构信息

Department of Preventive Medicine and Robert H, Lurie Cancer Center, Northwestern University, Chicago, IL, USA.

出版信息

BMC Bioinformatics. 2011 Dec 30;12:497. doi: 10.1186/1471-2105-12-497.

DOI:10.1186/1471-2105-12-497

PMID:22208906

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3276603/

Abstract

BACKGROUND

Multi-gene interactions likely play an important role in the development of complex phenotypes, and relationships between interacting genes pose a challenging statistical problem in microarray analysis, since the genes involved in these interactions may not exhibit marginal differential expression. As a result, it is necessary to develop tools that can identify sets of interacting genes that discriminate phenotypes without requiring that the classification boundary between phenotypes be convex.

RESULTS

We describe an extension and application of a new unsupervised statistical learning technique, known as the Partition Decoupling Method (PDM), to gene expression microarray data. This method may be used to classify samples based on multi-gene expression patterns and to identify pathways associated with phenotype, without relying upon the differential expression of individual genes. The PDM uses iterated spectral clustering and scrubbing steps, revealing at each iteration progressively finer structure in the geometry of the data. Because spectral clustering has the ability to discern clusters that are not linearly separable, it is able to articulate relationships between samples that would be missed by distance- and tree-based classifiers. After projecting the data onto the cluster centroids and computing the residuals ("scrubbing"), one can repeat the spectral clustering, revealing clusters that were not discernible in the first layer. These iterations, each of which provide a partition of the data that is decoupled from the others, are carried forward until the structure in the residuals is indistinguishable from noise, preventing over-fitting. We describe the PDM in detail and apply it to three publicly available cancer gene expression data sets. By applying the PDM on a pathway-by-pathway basis and identifying those pathways that permit unsupervised clustering of samples that match known sample characteristics, we show how the PDM may be used to find sets of mechanistically-related genes that may play a role in disease. An R package to carry out the PDM is available for download.

CONCLUSIONS

We show that the PDM is a useful tool for the analysis of gene expression data from complex diseases, where phenotypes are not linearly separable and multi-gene effects are likely to play a role. Our results demonstrate that the PDM is able to distinguish cell types and treatments with higher accuracy than is obtained through other approaches, and that the Pathway-PDM application is a valuable technique for identifying disease-associated pathways.

摘要

背景

多基因相互作用可能在复杂表型的发展中起着重要作用，而在微阵列分析中，相互作用基因之间的关系构成了一个具有挑战性的统计问题，因为这些相互作用涉及的基因可能不表现出边缘差异表达。因此，有必要开发能够识别区分表型的相互作用基因集的工具，而无需要求表型之间的分类边界是凸的。

结果

我们描述了一种新的无监督统计学习技术，称为分区分解方法（PDM）的扩展和应用，该技术可用于基因表达微阵列数据。该方法可用于基于多基因表达模式对样本进行分类，并识别与表型相关的途径，而无需依赖于单个基因的差异表达。PDM 使用迭代谱聚类和清理步骤，在每次迭代中揭示数据几何形状中越来越精细的结构。由于谱聚类具有辨别不可线性分离的聚类的能力，因此它能够阐明在距离和基于树的分类器中会错过的样本之间的关系。在将数据投影到聚类中心点并计算残差（“清理”）之后，可以重复进行谱聚类，从而揭示在第一层中无法辨别出的聚类。这些迭代每次都提供一个与其他迭代解耦的数据分区，直到残差中的结构与噪声无法区分，从而防止过度拟合。我们详细描述了 PDM，并将其应用于三个公开的癌症基因表达数据集。通过在途径对途径的基础上应用 PDM，并识别那些允许与已知样本特征匹配的样本进行无监督聚类的途径，我们展示了如何使用 PDM 找到可能在疾病中起作用的具有机制相关性的基因集。可用于执行 PDM 的 R 包可下载。

结论

我们表明 PDM 是分析复杂疾病基因表达数据的有用工具，其中表型不是线性可分离的，并且多基因效应可能起作用。我们的结果表明，PDM 能够比其他方法更准确地区分细胞类型和处理，并且途径-PDM 应用是识别与疾病相关途径的有价值的技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52cd/3276603/f0bf35ed2a99/1471-2105-12-497-1.jpg

相似文献

Partition decoupling for multi-gene analysis of gene expression profiling data.分群解耦在基因表达谱数据分析中的多基因分析

BMC Bioinformatics. 2011 Dec 30;12:497. doi: 10.1186/1471-2105-12-497.

Simultaneous gene clustering and subset selection for sample classification via MDL.通过最小描述长度实现用于样本分类的同步基因聚类和子集选择

Bioinformatics. 2003 Jun 12;19(9):1100-9. doi: 10.1093/bioinformatics/btg039.

Novel clustering algorithm for microarray expression data in a truncated SVD space.截断奇异值分解空间中微阵列表达数据的新型聚类算法。

Bioinformatics. 2003 Jun 12;19(9):1110-5. doi: 10.1093/bioinformatics/btg053.

An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data.一种用于微阵列数据中癌症类别发现和标记基因识别的无监督分层动态自组织方法。

Bioinformatics. 2003 Nov 1;19(16):2131-40. doi: 10.1093/bioinformatics/btg296.

Mining gene expression data by interpreting principal components.通过解释主成分挖掘基因表达数据。

BMC Bioinformatics. 2006 Apr 7;7:194. doi: 10.1186/1471-2105-7-194.

Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.使用监督学习组合帕累托最优聚类以识别共表达基因。

BMC Bioinformatics. 2009 Jan 20;10:27. doi: 10.1186/1471-2105-10-27.

Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification.基于 Pareto 最优解的 SVM 集成算法进行癌症亚型的多类聚类以识别基因标志物。

PLoS One. 2010 Nov 12;5(11):e13803. doi: 10.1371/journal.pone.0013803.

Analysis of a Gibbs sampler method for model-based clustering of gene expression data.一种基于模型的基因表达数据聚类的吉布斯采样器方法分析。

Bioinformatics. 2008 Jan 15;24(2):176-83. doi: 10.1093/bioinformatics/btm562. Epub 2007 Nov 22.

Exploring matrix factorization techniques for significant genes identification of Alzheimer's disease microarray gene expression data.探索矩阵分解技术在阿尔茨海默病基因表达数据中显著基因识别中的应用。

BMC Bioinformatics. 2011;12 Suppl 5(Suppl 5):S7. doi: 10.1186/1471-2105-12-S5-S7. Epub 2011 Jul 27.

Noise-robust soft clustering of gene expression time-course data.基因表达时间序列数据的抗噪声软聚类

J Bioinform Comput Biol. 2005 Aug;3(4):965-88. doi: 10.1142/s0219720005001375.

引用本文的文献

Distinguishing cell phenotype using cell epigenotype.使用细胞表型鉴定细胞表型。

Sci Adv. 2020 Mar 18;6(12):eaax7798. doi: 10.1126/sciadv.aax7798. eCollection 2020 Mar.

Tumour-specific Causal Inference Discovers Distinct Disease Mechanisms Underlying Cancer Subtypes.肿瘤特异性因果推理发现癌症亚型潜在的不同疾病机制。

Sci Rep. 2019 Sep 13;9(1):13225. doi: 10.1038/s41598-019-48318-7.

Integrative analysis reveals disrupted pathways regulated by microRNAs in cancer.综合分析揭示了 microRNAs 在癌症中调控的失调途径。

Nucleic Acids Res. 2018 Feb 16;46(3):1089-1101. doi: 10.1093/nar/gkx1250.

Systems analysis of high-throughput data.高通量数据的系统分析

Adv Exp Med Biol. 2014;844:153-87. doi: 10.1007/978-1-4939-2095-2_8.

Entangled communities and spatial synchronization lead to criticality in urban traffic.相互关联的社区和空间同步导致城市交通处于临界状态。

Sci Rep. 2013;3:1798. doi: 10.1038/srep01798.

Spectral clustering strategies for heterogeneous disease expression data.针对异质性疾病表达数据的谱聚类策略。

Pac Symp Biocomput. 2013:212-23.

本文引用的文献

Simple and flexible classification of gene expression microarrays via Swirls and Ripples.通过 Swirls 和 Ripples 实现基因表达微阵列的简单灵活分类。

BMC Bioinformatics. 2010 Sep 8;11:452. doi: 10.1186/1471-2105-11-452.

Simultaneous class discovery and classification of microarray data using spectral analysis.使用光谱分析对微阵列数据进行同步类别发现和分类。

J Comput Biol. 2009 Jul;16(7):935-44. doi: 10.1089/cmb.2008.0227.

QUBIC: a qualitative biclustering algorithm for analyses of gene expression data.QUBIC：一种用于基因表达数据分析的定性双聚类算法。

Nucleic Acids Res. 2009 Aug;37(15):e101. doi: 10.1093/nar/gkp491. Epub 2009 Jun 9.

A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation.对表达数据进行差异连线分析可正确识别出包含因果突变的基因。

PLoS Comput Biol. 2009 May;5(5):e1000382. doi: 10.1371/journal.pcbi.1000382. Epub 2009 May 1.

Clustering cancer gene expression data: a comparative study.癌症基因表达数据聚类：一项比较研究。

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

KEGG for linking genomes to life and the environment.京都基因与基因组百科全书，用于将基因组与生命及环境相联系。

Nucleic Acids Res. 2008 Jan;36(Database issue):D480-4. doi: 10.1093/nar/gkm882. Epub 2007 Dec 12.

Digital signal processing reveals circadian baseline oscillation in majority of mammalian genes.数字信号处理揭示了大多数哺乳动物基因中的昼夜节律基线振荡。

PLoS Comput Biol. 2007 Jun;3(6):e120. doi: 10.1371/journal.pcbi.0030120.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2007 Jan;35(Database issue):D5-12. doi: 10.1093/nar/gkl1031. Epub 2006 Dec 14.

Group testing for pathway analysis improves comparability of different microarray datasets.用于通路分析的分组检验可提高不同微阵列数据集的可比性。

Bioinformatics. 2006 Oct 15;22(20):2500-6. doi: 10.1093/bioinformatics/btl424. Epub 2006 Aug 7.

How does gene expression clustering work?基因表达聚类是如何工作的？

Nat Biotechnol. 2005 Dec;23(12):1499-501. doi: 10.1038/nbt1205-1499.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

分群解耦在基因表达谱数据分析中的多基因分析

Partition decoupling for multi-gene analysis of gene expression profiling data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献