Department of Computer Science and Engineering, The Ohio State University, 3165 Graves Hall 333 West 10th Avenue. Columbus, OH 43210, USA.
Brief Bioinform. 2013 May;14(3):279-92. doi: 10.1093/bib/bbs032. Epub 2012 Jul 6.
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
分析高维生物数据的需求推动了新的数据挖掘方法的发展。分簇算法已成功应用于基因表达数据,以发现局部模式,其中一组基因在一组条件下表现出相似的表达水平。然而,目前还不清楚哪种算法最适合这项任务。过去十年中已经发布了许多算法,其中大多数算法仅与少数几种算法进行了比较。文献中存在调查和比较,但由于分簇算法的数量众多且种类繁多,它们很快就过时了。在本文中,我们部分解决了评估现有分簇方法的优缺点的问题。我们使用 BiBench 包比较了 12 种算法,其中许多是最近发布的或尚未广泛研究的算法。这些算法在一系列合成数据集上进行了测试,以衡量它们在不同条件下(例如不同的分簇模型、不同的噪声、不同数量的分簇和重叠分簇)的数据上的性能。这些算法还在从基因表达综合数据库获得的八个大型基因表达数据集上进行了测试。对生成的分簇进行了基因本体富集分析,并报告了最佳的富集术语。我们的分析表明,分簇方法及其参数应根据所需的模型、模型是否允许重叠分簇以及其对噪声的鲁棒性来选择。此外,我们观察到能够找到多个模型的分簇算法更成功地捕获了具有生物学意义的簇。