LCE:一种基于链接的聚类集成方法,用于改进基因表达数据分析。
LCE: a link-based cluster ensemble method for improved gene expression data analysis.
机构信息
Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, UK.
出版信息
Bioinformatics. 2010 Jun 15;26(12):1513-9. doi: 10.1093/bioinformatics/btq226. Epub 2010 May 5.
MOTIVATION
It is far from trivial to select the most effective clustering method and its parameterization, for a particular set of gene expression data, because there are a very large number of possibilities. Although many researchers still prefer to use hierarchical clustering in one form or another, this is often sub-optimal. Cluster ensemble research solves this problem by automatically combining multiple data partitions from different clusterings to improve both the robustness and quality of the clustering result. However, many existing ensemble techniques use an association matrix to summarize sample-cluster co-occurrence statistics, and relations within an ensemble are encapsulated only at coarse level, while those existing among clusters are completely neglected. Discovering these missing associations may greatly extend the capability of the ensemble methodology for microarray data clustering.
RESULTS
The link-based cluster ensemble (LCE) method, presented here, implements these ideas and demonstrates outstanding performance. Experiment results on real gene expression and synthetic datasets indicate that LCE: (i) usually outperforms the existing cluster ensemble algorithms in individual tests and, overall, is clearly class-leading; (ii) generates excellent, robust performance across different types of data, especially with the presence of noise and imbalanced data clusters; (iii) provides a high-level data matrix that is applicable to many numerical clustering techniques; and (iv) is computationally efficient for large datasets and gene clustering.
AVAILABILITY
Online supplementary and implementation are available at: http://users.aber.ac.uk/nii07/bioinformatics2010.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
动机
为特定的基因表达数据集选择最有效的聚类方法及其参数化远非微不足道,因为有非常多的可能性。尽管许多研究人员仍然倾向于以某种形式使用层次聚类,但这往往不是最优的。聚类集成研究通过自动组合来自不同聚类的多个数据分区来解决这个问题,从而提高聚类结果的稳健性和质量。然而,许多现有的集成技术使用关联矩阵来总结样本聚类共现统计信息,并且仅在粗粒度级别上封装集成内的关系,而完全忽略聚类之间的关系。发现这些缺失的关联可能极大地扩展了用于微阵列数据聚类的集成方法的能力。
结果
这里提出的基于链接的聚类集成 (LCE) 方法实现了这些思想,并展示了出色的性能。在真实基因表达和合成数据集上的实验结果表明,LCE:(i)通常在单项测试中优于现有的聚类集成算法,并且总体上明显领先于其他方法;(ii)在不同类型的数据中生成出色、稳健的性能,尤其是在存在噪声和不平衡数据聚类的情况下;(iii)提供了一个适用于许多数值聚类技术的高级数据矩阵;(iv)对于大型数据集和基因聚类,计算效率高。
可用性
在线补充材料和实现可在以下网址获得:http://users.aber.ac.uk/nii07/bioinformatics2010。
补充信息
补充数据可在 Bioinformatics 在线获得。