计数数据聚类的统计学显著性。

Statistical significance of clustering for count data.

作者信息

Dai Yifan, Wu Di, Liu Yufeng

机构信息

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

Department of Biomedical Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

出版信息

Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf120.

DOI:10.1093/biomtc/ujaf120

PMID:40971569

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12448855/

Abstract

Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.

摘要

聚类在生物医学研究中被广泛用于有意义的亚组识别。然而，大多数现有的聚类算法没有考虑到所得聚类的统计不确定性，因此可能由于自然抽样变异而产生虚假聚类。为了解决这个问题，开发了聚类统计显著性（SigClust）方法来评估高维数据中聚类的显著性。虽然SigClust在评估连续数据的聚类显著性方面取得了成功，但它并非专门为离散数据设计，例如基因组学中的计数数据。此外，SigClust及其变体应用于非高斯高维数据时可能会出现统计功效降低的情况。为了克服这些限制，我们提出了SigClust-DEV，一种旨在评估计数数据中聚类显著性的方法。通过广泛的模拟，我们在各种计数分布上比较了SigClust-DEV与其他现有的SigClust方法，并证明了它的优越性能。此外，我们将提出的SigClust-DEV应用于九头蛇单细胞RNA测序（scRNA）数据和癌症患者的电子健康记录（EHR），分别识别有意义的潜在细胞类型和患者亚组。

相似文献

Statistical significance of clustering for count data.计数数据聚类的统计学显著性。

Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf120.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义

APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.

Interventions targeted at women to encourage the uptake of cervical screening.针对女性的干预措施，以鼓励她们接受宫颈癌筛查。

Cochrane Database Syst Rev. 2021 Sep 6;9(9):CD002834. doi: 10.1002/14651858.CD002834.pub3.

Soft graph clustering for single-cell RNA sequencing data.用于单细胞RNA测序数据的软图聚类

BMC Bioinformatics. 2025 Jul 25;26(1):195. doi: 10.1186/s12859-025-06231-z.

Comparison of cellulose, modified cellulose and synthetic membranes in the haemodialysis of patients with end-stage renal disease.纤维素、改性纤维素和合成膜在终末期肾病患者血液透析中的比较。

Cochrane Database Syst Rev. 2001(3):CD003234. doi: 10.1002/14651858.CD003234.

Reference Vector-guided Evolutionary Algorithm for cluster analysis of single-cell transcriptomes.用于单细胞转录组聚类分析的参考向量引导进化算法

Comput Methods Programs Biomed. 2025 Sep;269:108873. doi: 10.1016/j.cmpb.2025.108873. Epub 2025 Jun 6.

[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果：来自系统评价和意大利医院数据评估的证据]

Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.

Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作：定性证据综合评价。

Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.

Perioperative oxygen therapy in patients undergoing surgical procedures: an overview of systematic reviews and meta-analyses.手术患者围手术期氧疗：系统评价和荟萃分析概述

Health Technol Assess. 2025 Sep;29(44):1-139. doi: 10.3310/TNTC4360.

本文引用的文献

Statistical Significance of Clustering with Multidimensional Scaling.多维缩放聚类的统计显著性

J Comput Graph Stat. 2024;33(1):219-230. doi: 10.1080/10618600.2023.2219708. Epub 2023 Jul 20.

Selective Inference for Hierarchical Clustering.层次聚类的选择性推断

J Am Stat Assoc. 2024;119(545):332-342. doi: 10.1080/01621459.2022.2116331. Epub 2022 Oct 11.

Selective inference for -means clustering.均值聚类的选择性推断。

J Mach Learn Res. 2023 May;24.

Significance analysis for clustering with single-cell RNA-sequencing data.基于单细胞 RNA-seq 数据的聚类意义分析。

Nat Methods. 2023 Aug;20(8):1196-1202. doi: 10.1038/s41592-023-01933-9. Epub 2023 Jul 10.

An analysis of classical multidimensional scaling with applications to clustering.经典多维缩放分析及其在聚类中的应用。

Inf inference. 2022 Apr 23;12(1):72-112. doi: 10.1093/imaiai/iaac004. eCollection 2023 Mar.

Multiomics in primary and metastatic breast tumors from the AURORA US network finds microenvironment and epigenetic drivers of metastasis.AURORA US 网络的原发性和转移性乳腺癌的多组学研究发现了转移的微环境和表观遗传驱动因素。

Nat Cancer. 2023 Jan;4(1):128-147. doi: 10.1038/s43018-022-00491-x. Epub 2022 Dec 30.

Evaluating single-cell cluster stability using the Jaccard similarity index.使用 Jaccard 相似性指数评估单细胞聚类稳定性。

Bioinformatics. 2021 Aug 9;37(15):2212-2214. doi: 10.1093/bioinformatics/btaa956.

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.基于多项模型的单细胞 RNA-Seq 特征选择和降维。

Genome Biol. 2019 Dec 23;20(1):295. doi: 10.1186/s13059-019-1861-6.

Stem cell differentiation trajectories in resolved at single-cell resolution.解析单细胞分辨率中的干细胞分化轨迹。

Science. 2019 Jul 26;365(6451). doi: 10.1126/science.aav9314.

Transcriptomic and morphophysiological evidence for a specialized human cortical GABAergic cell type.一种特殊人类皮质GABA能细胞类型的转录组学和形态生理学证据。

Nat Neurosci. 2018 Sep;21(9):1185-1195. doi: 10.1038/s41593-018-0205-2. Epub 2018 Aug 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验