Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA.
Bioinformatics. 2020 Jun 1;36(11):3516-3521. doi: 10.1093/bioinformatics/btaa165.
Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community.
We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods.
The method is implemented in an R package called OTclust, available on CRAN.
lzz46@psu.edu or jiali@psu.edu.
Supplementary data are available at Bioinformatics online.
聚类分析被广泛用于识别生物医学数据中的有趣子组。由于在无监督设置中不知道真实的类别标签,因此很难对计算得到的任何聚类进行验证,这是研究社区几乎没有解决的一个重要问题。
我们开发了一个名为覆盖点集(CPS)分析的工具包,用于量化个体聚类和整体分区水平的不确定性。已经开发了函数,可有效地可视化高维数据中任何聚类的固有变化,并提供有关数据中潜在有趣子组的更全面视图。将其应用于生物医学数据的三个使用场景,我们证明与最先进的度量标准相比,CPS 分析在评估聚类的不确定性方面更为有效。我们还展示了如何使用 CPS 分析来选择数据生成技术或可视化方法。
该方法在一个名为 OTclust 的 R 包中实现,可在 CRAN 上获得。
lzz46@psu.edu 或 jiali@psu.edu。
补充数据可在 Bioinformatics 在线获得。