Suppr超能文献

具有 H 的可扩展且无偏的不和谐度量。

A scalable and unbiased discordance metric with H.

机构信息

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA.

Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA.

出版信息

Biostatistics. 2023 Dec 15;25(1):188-202. doi: 10.1093/biostatistics/kxac035.

Abstract

A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the "scale-agnostic" $G_{+}$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $k$ groups, we show that $G_{+}$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $G_{+}$, referred to as $H_{+}$, and demonstrate that $H_{+}$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $H_{+}$, which are available in the $\mathtt{fasthplus}$ R package.

摘要

一种标准的无监督分析方法是使用一种不相似性度量(如欧几里得距离)将观测值聚类到离散的组中。如果对于每个观测值都不存在用于外部有效性度量的ground-truth 标签,那么通常使用内部有效性度量,如聚类的紧密度或分离度。然而,当使用不同的不相似性度量时,这些内部度量的解释可能会出现问题,因为它们具有不同的量级和取值范围。为了解决这个问题,之前的工作引入了“与尺度无关的”$G_{+}$不和谐度量;然而,这个内部度量对于大数据集来说计算速度较慢。此外,在具有$k$个组的无监督聚类设置中,我们表明$G_{+}$随观测值分配到每个组(或聚类)的比例(即组平衡)而变化,这是一个不理想的特性。为了解决这个问题,我们提出了对$G_{+}$的修改,称为$H_{+}$,并通过模拟研究和公共单细胞 RNA 测序数据证明了$H_{+}$不随组平衡而变化。最后,我们提供了可扩展的方法来估计$H_{+}$,这些方法可在$\mathtt{fasthplus}$R 包中使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/cc5b11bbf04c/kxac035f1.jpg

相似文献

1
A scalable and unbiased discordance metric with H.具有 H 的可扩展且无偏的不和谐度量。
Biostatistics. 2023 Dec 15;25(1):188-202. doi: 10.1093/biostatistics/kxac035.
4
clusterBMA: Bayesian model averaging for clustering.聚类 BMA:用于聚类的贝叶斯模型平均。
PLoS One. 2023 Aug 21;18(8):e0288000. doi: 10.1371/journal.pone.0288000. eCollection 2023.
9
Simulation-derived best practices for clustering clinical data.基于模拟的临床数据聚类最佳实践。
J Biomed Inform. 2021 Jun;118:103788. doi: 10.1016/j.jbi.2021.103788. Epub 2021 Apr 20.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验