Suppr超能文献

基于β散度的稳健互补层次聚类在基因表达数据分析中的应用。

Robust complementary hierarchical clustering for gene expression data analysis by β-divergence.

机构信息

Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.

出版信息

J Biosci Bioeng. 2013 Sep;116(3):397-407. doi: 10.1016/j.jbiosc.2013.03.010. Epub 2013 Apr 19.

Abstract

A hierarchical clustering (HC) algorithm is one of the most widely used unsupervised statistical techniques for analyzing microarray gene expression data. When applying the HC algorithm to the gene expression data to cluster individuals, most of the HC algorithms generate clusters based on the highly differentially expressed (DE) genes that have very similar expression patterns. These highly DE genes may sometimes be irrelevant in biological processes. The serious problem is that those irrelevant genes with high expressions potentially drown out the low expressed genes that have important biological functions. To overcome the problem, Nowak and Tibshirani proposed the complementary hierarchical clustering (CHC) (Biostatistics, 9, 467-483, 2008). However, it is not robust against outlying expression and often produces misleading results if there exist some contaminations in the gene expression data. Thus, we propose the robust CHC (RCHC) method to robustify the CHC with respect to outliers by maximizing the β-likelihood function for sequential extraction of a gene-set with proper groups of individuals. Note that the proposed method reduces to the CHC with the tuning parameter β → 0. A value of β plays a key role in the performance of the RCHC method, which controls the tradeoff between the robustness and efficiency of the estimators. Using simulation and real gene expression analysis, the RCHC method shows robust properties to gene expression clustering with respect to data contaminations, overcomes the problem of the CHC, and predicts critically important genes from breast cancer data.

摘要

层次聚类(HC)算法是分析微阵列基因表达数据最广泛使用的非监督统计技术之一。当将 HC 算法应用于基因表达数据以对个体进行聚类时,大多数 HC 算法基于具有非常相似表达模式的高度差异表达(DE)基因生成聚类。这些高度 DE 基因在生物学过程中有时可能不相关。严重的问题是,那些具有高表达的不相关基因可能会淹没具有重要生物学功能的低表达基因。为了解决这个问题,Nowak 和 Tibshirani 提出了互补层次聚类(CHC)(Biostatistics,9,467-483,2008)。然而,如果基因表达数据中存在一些污染,它对离群表达不稳健,并且经常产生误导性结果。因此,我们提出了稳健的 CHC(RCHC)方法,通过最大化β似然函数来稳健化 CHC,以便顺序提取具有适当个体组的基因集。请注意,所提出的方法在调整参数β→0 时简化为 CHC。β的值在 RCHC 方法的性能中起着关键作用,它控制了估计量的稳健性和效率之间的权衡。通过仿真和真实的基因表达分析,RCHC 方法在基因表达聚类方面表现出稳健的特性,克服了 CHC 的问题,并从乳腺癌数据中预测了至关重要的基因。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验