• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有 H 的可扩展且无偏的不和谐度量。

A scalable and unbiased discordance metric with H.

机构信息

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA.

Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA.

出版信息

Biostatistics. 2023 Dec 15;25(1):188-202. doi: 10.1093/biostatistics/kxac035.

DOI:10.1093/biostatistics/kxac035
PMID:36063544
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10724244/
Abstract

A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the "scale-agnostic" $G_{+}$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $k$ groups, we show that $G_{+}$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $G_{+}$, referred to as $H_{+}$, and demonstrate that $H_{+}$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $H_{+}$, which are available in the $\mathtt{fasthplus}$ R package.

摘要

一种标准的无监督分析方法是使用一种不相似性度量(如欧几里得距离)将观测值聚类到离散的组中。如果对于每个观测值都不存在用于外部有效性度量的ground-truth 标签,那么通常使用内部有效性度量,如聚类的紧密度或分离度。然而,当使用不同的不相似性度量时,这些内部度量的解释可能会出现问题,因为它们具有不同的量级和取值范围。为了解决这个问题,之前的工作引入了“与尺度无关的”$G_{+}$不和谐度量;然而,这个内部度量对于大数据集来说计算速度较慢。此外,在具有$k$个组的无监督聚类设置中,我们表明$G_{+}$随观测值分配到每个组(或聚类)的比例(即组平衡)而变化,这是一个不理想的特性。为了解决这个问题,我们提出了对$G_{+}$的修改,称为$H_{+}$,并通过模拟研究和公共单细胞 RNA 测序数据证明了$H_{+}$不随组平衡而变化。最后,我们提供了可扩展的方法来估计$H_{+}$,这些方法可在$\mathtt{fasthplus}$R 包中使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/908599d661a9/kxac035f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/cc5b11bbf04c/kxac035f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/81ab467185f5/kxac035f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/fb1fb7ab4d2d/kxac035f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/1377ef6638bc/kxac035f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/908599d661a9/kxac035f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/cc5b11bbf04c/kxac035f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/81ab467185f5/kxac035f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/fb1fb7ab4d2d/kxac035f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/1377ef6638bc/kxac035f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2296/10724244/908599d661a9/kxac035f5.jpg

相似文献

1
A scalable and unbiased discordance metric with H.具有 H 的可扩展且无偏的不和谐度量。
Biostatistics. 2023 Dec 15;25(1):188-202. doi: 10.1093/biostatistics/kxac035.
2
Evaluation of standard and semantically-augmented distance metrics for neurology patients.评估标准和语义增强距离度量在神经病学患者中的应用。
BMC Med Inform Decis Mak. 2020 Aug 26;20(1):203. doi: 10.1186/s12911-020-01217-8.
3
Impact of similarity metrics on single-cell RNA-seq data clustering.相似度度量对单细胞 RNA-seq 数据聚类的影响。
Brief Bioinform. 2019 Nov 27;20(6):2316-2326. doi: 10.1093/bib/bby076.
4
clusterBMA: Bayesian model averaging for clustering.聚类 BMA:用于聚类的贝叶斯模型平均。
PLoS One. 2023 Aug 21;18(8):e0288000. doi: 10.1371/journal.pone.0288000. eCollection 2023.
5
Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number, K.多种聚类算法的联合映射(COMMUNAL):一种选择聚类数K的稳健方法。
Sci Rep. 2015 Nov 19;5:16971. doi: 10.1038/srep16971.
6
Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond.迈向高维数据的多维度集成聚类:从子空间到度量及其他
IEEE Trans Cybern. 2022 Nov;52(11):12231-12244. doi: 10.1109/TCYB.2021.3049633. Epub 2022 Oct 17.
7
Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets.使用模拟数据集评估基于样本的RNA测序数据层次聚类的差异度量
PLoS One. 2015 Jul 10;10(7):e0132310. doi: 10.1371/journal.pone.0132310. eCollection 2015.
8
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
9
Simulation-derived best practices for clustering clinical data.基于模拟的临床数据聚类最佳实践。
J Biomed Inform. 2021 Jun;118:103788. doi: 10.1016/j.jbi.2021.103788. Epub 2021 Apr 20.
10
Metric for measuring the effectiveness of clustering of DNA microarray expression.用于测量 DNA 微阵列表达聚类有效性的度量。
BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.

引用本文的文献

1
Spatio-molecular gene expression reflects dorsal anterior cingulate cortex structure and function in the human brain.空间分子基因表达反映了人类大脑背侧前扣带回皮质的结构和功能。
bioRxiv. 2025 Jul 17:2025.07.14.664821. doi: 10.1101/2025.07.14.664821.
2
A data-driven single-cell and spatial transcriptomic map of the human prefrontal cortex.基于数据驱动的人类前额叶皮层单细胞和空间转录组图谱。
Science. 2024 May 24;384(6698):eadh1938. doi: 10.1126/science.adh1938.

本文引用的文献

1
Fast and memory-efficient scRNA-seq -means clustering with various distances.快速且内存高效的单细胞RNA测序——使用各种距离的均值聚类。
ACM BCB. 2021 Aug;2021. doi: 10.1145/3459930.3469523.
2
Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.使用混合对照实验对标单细胞 RNA 测序分析流程。
Nat Methods. 2019 Jun;16(6):479-487. doi: 10.1038/s41592-019-0425-8. Epub 2019 May 27.
3
A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.使用Bioconductor进行单细胞RNA测序数据低级分析的逐步工作流程。
F1000Res. 2016 Aug 31;5:2122. doi: 10.12688/f1000research.9501.2. eCollection 2016.
4
The top-scoring 'N' algorithm: a generalized relative expression classification method from small numbers of biomolecules.顶级评分 'N' 算法:一种从少量生物分子中进行广义相对表达分类的方法。
BMC Bioinformatics. 2012 Sep 11;13:227. doi: 10.1186/1471-2105-13-227.
5
The tspair package for finding top scoring pair classifiers in R.用于在R中查找最高得分对分类器的tspair包。
Bioinformatics. 2009 May 1;25(9):1203-4. doi: 10.1093/bioinformatics/btp126. Epub 2009 Mar 10.