Department of Mathematics, KU Leuven, Leuven 3001, Belgium.
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Bioinformatics. 2020 Jun 1;36(12):3849-3855. doi: 10.1093/bioinformatics/btaa243.
Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research.
We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients.
An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software.
Supplementary data are available at Bioinformatics online.
许多流行的聚类方法不是尺度不变的,因为它们是基于欧几里得距离的。即使是使用尺度不变距离的方法,如马氏距离,当与正则化和/或变量选择结合使用时,也会失去尺度不变性。因此,这些方法的结果对聚类变量的度量单位非常敏感。实现尺度不变性的一种简单方法是在聚类之前对变量进行缩放。然而,在聚类分析中缩放变量是一个非常微妙的问题:缩放的选择不当会对聚类结果产生不利影响。另一方面,报告依赖于度量单位的聚类结果是不能令人满意的。因此,生物信息学和医学科学研究中的应用需要一种安全有效的缩放程序。
我们提出了一种基于 pooled variance 概念的聚类分析前缩放新方法。与可用的缩放过程(如 SD 和范围)不同,我们提出的 scale 避免了减弱信息聚类变量的有益效果。通过广泛的模拟研究和对知名真实数据示例的应用,我们证实了所提出的缩放方法是安全且普遍有用的。最后,我们使用我们的方法对一个由来自人类患者的多个乳腺癌细胞组织的基因表达数据组成的高维基因组数据集进行聚类。
我们在 https://wis.kuleuven.be/statdatascience/robust/software 上提供了算法的 R 实现。
补充数据可在 Bioinformatics 在线获取。