The Jackson Laboratory, Bar Harbor, ME 04609, USA.
BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-13-S10-S7.
A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae.
For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method.
Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods.
Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.
大量的聚类算法已被应用于基因共表达实验。这些算法涵盖了广泛的方法,从传统的技术,如 k-均值和层次聚类,到图形方法,如 k-团社区、加权基因共表达网络(WGCNA)和并集。比较这些方法以评估它们的相对有效性,可以为算法的选择、开发和实施提供指导。大多数关于比较聚类评估的先前工作都集中在参数方法上。图论方法是用于全局分析和分解微阵列共表达矩阵的工具集的最新补充,这些方法通常不包括在早期的方法比较中。在本研究中,使用来自酿酒酵母的全基因组规模的特征明确的转录组数据,比较了各种参数和图论聚类算法。
对于每种研究中的聚类方法,测试了多种参数。Jaccard 相似性用于测量每个簇与每个 GO 和 KEGG 注释集的一致性,并且将最高的 Jaccard 得分分配给该簇。将簇分为小、中、大三个 bin,并且将每个 bin 中得分最高的五个簇的 Jaccard 得分平均并报告为特定方法的最佳平均前 5 名(BAT5)得分。
基于与已知途径的阳性匹配,评估每个方法产生的簇。这产生了一种对基因聚类相对有效性的易于解释的排序。还测试了方法,以确定它们是否能够识别与其他聚类方法识别的簇一致的簇。
对已知基因分类的簇进行验证表明,对于此数据,基于图的技术优于传统聚类方法,这表明需要进一步开发和应用组合策略。