Suppr超能文献

使用一致性算法对大型DNA微阵列数据集进行稳健的多尺度聚类

Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm.

作者信息

Grotkjaer Thomas, Winther Ole, Regenberg Birgitte, Nielsen Jens, Hansen Lars Kai

机构信息

Center for Microbial Biotechnology BioCentrum-DTU, Building 223, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark.

出版信息

Bioinformatics. 2006 Jan 1;22(1):58-67. doi: 10.1093/bioinformatics/bti746. Epub 2005 Oct 27.

Abstract

MOTIVATION

Hierarchical and relocation clustering (e.g. K-means and self-organizing maps) have been successful tools in the display and analysis of whole genome DNA microarray expression data. However, the results of hierarchical clustering are sensitive to outliers, and most relocation methods give results which are dependent on the initialization of the algorithm. Therefore, it is difficult to assess the significance of the results. We have developed a consensus clustering algorithm, where the final result is averaged over multiple clustering runs, giving a robust and reproducible clustering, capable of capturing small signal variations. The algorithm preserves valuable properties of hierarchical clustering, which is useful for visualization and interpretation of the results.

RESULTS

We show for the first time that one can take advantage of multiple clustering runs in DNA microarray analysis by collecting re-occurring clustering patterns in a co-occurrence matrix. The results show that consensus clustering obtained from clustering multiple times with Variational Bayes Mixtures of Gaussians or K-means significantly reduces the classification error rate for a simulated dataset. The method is flexible and it is possible to find consensus clusters from different clustering algorithms. Thus, the algorithm can be used as a framework to test in a quantitative manner the homogeneity of different clustering algorithms. We compare the method with a number of state-of-the-art clustering methods. It is shown that the method is robust and gives low classification error rates for a realistic, simulated dataset. The algorithm is also demonstrated for real datasets. It is shown that more biological meaningful transcriptional patterns can be found without conservative statistical or fold-change exclusion of data.

AVAILABILITY

Matlab source code for the clustering algorithm ClusterLustre, and the simulated dataset for testing are available upon request from T.G. and O.W.

摘要

动机

层次聚类和重定位聚类(如K均值聚类和自组织映射)已成为显示和分析全基因组DNA微阵列表达数据的成功工具。然而,层次聚类的结果对异常值敏感,并且大多数重定位方法给出的结果依赖于算法的初始化。因此,难以评估结果的显著性。我们开发了一种共识聚类算法,其中最终结果是在多次聚类运行的基础上进行平均,从而得到一个稳健且可重复的聚类,能够捕捉到小的信号变化。该算法保留了层次聚类的宝贵特性,这对于结果的可视化和解释很有用。

结果

我们首次表明,通过在共现矩阵中收集反复出现的聚类模式,可以在DNA微阵列分析中利用多次聚类运行。结果表明,使用高斯变分贝叶斯混合模型或K均值聚类多次聚类得到的共识聚类显著降低了模拟数据集的分类错误率。该方法具有灵活性,可以从不同的聚类算法中找到共识聚类。因此,该算法可以用作一个框架,以定量方式测试不同聚类算法的同质性。我们将该方法与许多先进的聚类方法进行了比较。结果表明,该方法稳健,对于一个真实的模拟数据集给出了较低的分类错误率。该算法也在真实数据集上进行了演示。结果表明,无需对数据进行保守的统计或倍数变化排除,就可以找到更多具有生物学意义的转录模式。

可用性

可根据T.G.和O.W.的要求获取聚类算法ClusterLustre的Matlab源代码以及用于测试的模拟数据集。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验