Zambelli Antoine E
Quantech Solutions LLC, San Rafael, CA, USA.
F1000Res. 2016 Dec 1;5. doi: 10.12688/f1000research.10103.1. eCollection 2016.
DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.
DNA微阵列和基因表达问题常常要求研究人员对其数据进行聚类,以便更好地理解数据结构。在聚类数量未知的情况下,可以采用层次聚类方法。然而,目前几乎没有自动算法可用于确定数据中聚类的真实数量。我们提出了两种新方法(众数法和最大差异法),用于在层次聚类框架中估计聚类数量,以创建一个无需人工干预的完全自动化过程。我们使用模拟数据集和Biobase基因表达集,将这些方法与既定的肘部法和间隙统计算法进行了比较。我们还探索了一种受交叉验证技术启发的数据混合程序。我们发现,在多聚类场景中,最大差异法的总体性能与间隙统计法相当或更优,并且以一小部分计算成本实现了该性能。该方法对我们的数据混合程序也有良好响应,这为未来的研究打开了大门。我们得出结论,众数法和最大差异法在其混合和交叉验证潜力方面都值得进一步研究。鉴于其准确性和执行时间,我们特别推荐在多聚类场景中使用最大差异法,并将其作为现有算法的替代方法。