CPS 分析：生物医学数据聚类的自包含验证。

CPS analysis: self-contained validation of biomedical data clustering.

机构信息

Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA.

出版信息

Bioinformatics. 2020 Jun 1;36(11):3516-3521. doi: 10.1093/bioinformatics/btaa165.

DOI:10.1093/bioinformatics/btaa165

PMID:32154841

Abstract

MOTIVATION

Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community.

RESULTS

We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods.

AVAILABILITY AND IMPLEMENTATION

The method is implemented in an R package called OTclust, available on CRAN.

CONTACT

lzz46@psu.edu or jiali@psu.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

聚类分析被广泛用于识别生物医学数据中的有趣子组。由于在无监督设置中不知道真实的类别标签，因此很难对计算得到的任何聚类进行验证，这是研究社区几乎没有解决的一个重要问题。

结果

我们开发了一个名为覆盖点集（CPS）分析的工具包，用于量化个体聚类和整体分区水平的不确定性。已经开发了函数，可有效地可视化高维数据中任何聚类的固有变化，并提供有关数据中潜在有趣子组的更全面视图。将其应用于生物医学数据的三个使用场景，我们证明与最先进的度量标准相比，CPS 分析在评估聚类的不确定性方面更为有效。我们还展示了如何使用 CPS 分析来选择数据生成技术或可视化方法。

可用性和实现

该方法在一个名为 OTclust 的 R 包中实现，可在 CRAN 上获得。

联系方式

lzz46@psu.edu 或 jiali@psu.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

CPS analysis: self-contained validation of biomedical data clustering.CPS 分析：生物医学数据聚类的自包含验证。

Bioinformatics. 2020 Jun 1;36(11):3516-3521. doi: 10.1093/bioinformatics/btaa165.

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data.DIMM-SC：一种基于 Dirichlet 混合模型的用于聚类基于液滴的单细胞转录组学数据的方法。

Bioinformatics. 2018 Jan 1;34(1):139-146. doi: 10.1093/bioinformatics/btx490.

i2d: an R package for simulating data from images and the implications in biomedical research.i2d：一个用于模拟图像数据的 R 包及其在生物医学研究中的应用。

Bioinformatics. 2021 Aug 25;37(16):2497-2498. doi: 10.1093/bioinformatics/btaa991.

A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort.贝叶斯双向潜在结构模型用于基因组数据整合，揭示乳腺癌队列中很少有泛基因组聚类亚型。

Bioinformatics. 2019 Dec 1;35(23):4886-4897. doi: 10.1093/bioinformatics/btz381.

projectR: an R/Bioconductor package for transfer learning via PCA, NMF, correlation and clustering.projectR：一个用于通过 PCA、NMF、相关性和聚类进行迁移学习的 R/Bioconductor 包。

Bioinformatics. 2020 Jun 1;36(11):3592-3593. doi: 10.1093/bioinformatics/btaa183.

A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits.基于表型特征的多组学数据潜在未知聚类（LUCID）。

Bioinformatics. 2020 Feb 1;36(3):842-850. doi: 10.1093/bioinformatics/btz667.

Statistical significance of cluster membership for unsupervised evaluation of cell identities.无监督评估细胞身份的聚类成员的统计显著性。

Bioinformatics. 2020 May 1;36(10):3107-3114. doi: 10.1093/bioinformatics/btaa087.

Defining an informativeness metric for clustering gene expression data.定义用于聚类基因表达数据的信息量度量。

Bioinformatics. 2011 Apr 15;27(8):1094-100. doi: 10.1093/bioinformatics/btr074. Epub 2011 Feb 16.

VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values.VarSelLCM：用于基于模型的混合数据缺失值聚类中变量选择的 R/C++ 包。

Bioinformatics. 2019 Apr 1;35(7):1255-1257. doi: 10.1093/bioinformatics/bty786.

R.JIVE for exploration of multi-source molecular data.用于多源分子数据探索的R.JIVE

Bioinformatics. 2016 Sep 15;32(18):2877-9. doi: 10.1093/bioinformatics/btw324. Epub 2016 Jun 6.

引用本文的文献

GeM-LR: Discovering predictive biomarkers for small datasets in vaccine studies.GeM-LR：在疫苗研究中发现小数据集的预测生物标志物。

PLoS Comput Biol. 2024 Nov 14;20(11):e1012581. doi: 10.1371/journal.pcbi.1012581. eCollection 2024 Nov.

Statistical and machine learning methods for immunoprofiling based on single-cell data.基于单细胞数据的免疫分析的统计和机器学习方法。

Hum Vaccin Immunother. 2023 Aug 1;19(2):2234792. doi: 10.1080/21645515.2023.2234792. Epub 2023 Jul 24.

Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data.通过 CPS-merge 分析的多角度聚类及其在多模态单细胞数据中的应用。

PLoS Comput Biol. 2023 Apr 17;19(4):e1011044. doi: 10.1371/journal.pcbi.1011044. eCollection 2023 Apr.

Stability estimation for unsupervised clustering: A review.无监督聚类的稳定性估计：综述

Wiley Interdiscip Rev Comput Stat. 2022 Nov-Dec;14(6):e1575. doi: 10.1002/wics.1575. Epub 2022 Jan 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

CPS 分析：生物医学数据聚类的自包含验证。

CPS analysis: self-contained validation of biomedical data clustering.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

联系方式

补充信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献