• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在一个数据集中发现的聚类在另一个数据集中也存在吗?

Are clusters found in one dataset present in another dataset?

作者信息

Kapp Amy V, Tibshirani Robert

机构信息

Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA.

出版信息

Biostatistics. 2007 Jan;8(1):9-31. doi: 10.1093/biostatistics/kxj029. Epub 2006 Apr 12.

DOI:10.1093/biostatistics/kxj029
PMID:16613834
Abstract

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

摘要

在许多微阵列研究中,人们会在一个独立的数据集中寻找在另一个数据集上定义的聚类。如果在新数据集中找到了该聚类,那么这个聚类就被认为是“可重现的”,并且可能具有生物学意义。将一个新数据归类到先前定义的聚类中,可以看作是预测哪个先前定义的聚类与新数据最相似。如果归类到一个聚类中的新数据在分子或临床方面与该聚类中已有的数据相似,那么这个聚类就是可重现的,相应的预测准确性也很高。在此,我们利用可重现性与预测准确性之间的联系,为在与聚类特征化数据集无关的其他数据集中找到的聚类开发一种验证程序。我们定义了一种称为“组内比例”(IGP)的聚类质量度量,并引入了一种单独验证聚类的通用程序。通过模拟和真实的乳腺癌数据集,将IGP与其他四种常用的聚类质量度量(同质性得分、分离得分、轮廓宽度和加权平均差异对得分)进行比较。此外,还利用模拟和真实的乳腺癌数据集对所有使用IGP但在生成零分布方式上有所不同的四种验证程序版本进行比较。我们发现IGP是预测准确性的最佳度量,并且其中一种验证程序版本比其他三种更具广泛适用性。该算法的一个实现版本包含在一个名为“clusterRepro”的包中,可通过综合R存档网络(http://cran.r-project.org)获取。

相似文献

1
Are clusters found in one dataset present in another dataset?在一个数据集中发现的聚类在另一个数据集中也存在吗?
Biostatistics. 2007 Jan;8(1):9-31. doi: 10.1093/biostatistics/kxj029. Epub 2006 Apr 12.
2
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.聚类验证指标的加权排序聚合:一种蒙特卡洛交叉熵方法。
Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.
3
Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization.通过聚类集成和基于期望最大化的信息文本摘要进行微阵列基因簇识别与注释。
IEEE Trans Inf Technol Biomed. 2009 Sep;13(5):832-40. doi: 10.1109/TITB.2009.2023984. Epub 2009 Jun 12.
4
Class discovery from gene expression data based on perturbation and cluster ensemble.基于扰动和聚类集成从基因表达数据中发现类别
IEEE Trans Nanobioscience. 2009 Jun;8(2):147-60. doi: 10.1109/TNB.2009.2023321. Epub 2009 Jun 2.
5
Clustering microarray gene expression data using weighted Chinese restaurant process.使用加权中国餐馆过程对微阵列基因表达数据进行聚类
Bioinformatics. 2006 Aug 15;22(16):1988-97. doi: 10.1093/bioinformatics/btl284. Epub 2006 Jun 9.
6
Averaged gene expressions for regression.用于回归的平均基因表达。
Biostatistics. 2007 Apr;8(2):212-27. doi: 10.1093/biostatistics/kxl002. Epub 2006 May 11.
7
Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles.用于基因分组的分裂相关聚类算法(DCCA):检测表达谱中的变化模式。
Bioinformatics. 2008 Jun 1;24(11):1359-66. doi: 10.1093/bioinformatics/btn133. Epub 2008 Apr 10.
8
Investigation of self-organizing oscillator networks for use in clustering microarray data.用于微阵列数据聚类的自组织振荡器网络研究。
IEEE Trans Nanobioscience. 2008 Mar;7(1):65-79. doi: 10.1109/TNB.2008.2000151.
9
A simple and robust algorithm for microarray data clustering based on gene population-variance ratio metric.一种基于基因群体方差比度量的简单且稳健的微阵列数据聚类算法。
Biotechnol J. 2009 Sep;4(9):1357-61. doi: 10.1002/biot.200800219.
10
Challenges in projecting clustering results across gene expression-profiling datasets.跨基因表达谱数据集预测聚类结果面临的挑战。
J Natl Cancer Inst. 2007 Nov 21;99(22):1715-23. doi: 10.1093/jnci/djm216. Epub 2007 Nov 13.

引用本文的文献

1
Quality-of-life scale machine learning approach to predict immunotherapy response in patients with advanced non-small cell lung cancer.采用生活质量量表机器学习方法预测晚期非小细胞肺癌患者的免疫治疗反应。
Front Immunol. 2025 Jul 18;16:1600265. doi: 10.3389/fimmu.2025.1600265. eCollection 2025.
2
Conserved noncoding cis elements associated with hibernation modulate metabolic and behavioral adaptations in mice.与冬眠相关的保守非编码顺式元件调节小鼠的代谢和行为适应性。
Science. 2025 Jul 31;389(6759):501-507. doi: 10.1126/science.adp4701.
3
Molecular Phenogroups in Heart Failure: Large-Scale Proteomics in a Population-Based Cohort.
心力衰竭中的分子表型组:基于人群队列的大规模蛋白质组学研究
Circ Genom Precis Med. 2025 Jul 16:e004953. doi: 10.1161/CIRCGEN.124.004953.
4
Molecular precision medicine: Multi-omics-based stratification model for acute myeloid leukemia.分子精准医学:基于多组学的急性髓系白血病分层模型
Heliyon. 2024 Aug 20;10(17):e36155. doi: 10.1016/j.heliyon.2024.e36155. eCollection 2024 Sep 15.
5
MOVICShiny: An interactive website for multi-omics integration and visualisation in cancer subtyping.MOVICShiny:一个用于癌症亚型多组学整合与可视化的交互式网站。
Clin Transl Med. 2024 Mar;14(3):e1606. doi: 10.1002/ctm2.1606.
6
Subtyping of COVID-19 samples based on cell-cell interaction in single cell transcriptomes.基于单细胞转录组中细胞间相互作用对 COVID-19 样本进行亚型分类。
Sci Rep. 2023 Nov 10;13(1):19629. doi: 10.1038/s41598-023-46350-2.
7
Cross-Study Replicability in Cluster Analysis.聚类分析中的跨研究可重复性
Stat Sci. 2023 May;38(2):303-316. doi: 10.1214/22-sts871. Epub 2023 Feb 6.
8
MRGCN: cancer subtyping with multi-reconstruction graph convolutional network using full and partial multi-omics dataset.MRGCN:基于全和部分多组学数据集的多重建图卷积网络进行癌症亚型分类。
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad353.
9
regulates a second-guessing cognitive bias during naturalistic foraging through effects on discrete behavior modules.通过对离散行为模块的影响,在自然觅食过程中调节事后猜测认知偏差。
iScience. 2023 Apr 27;26(5):106761. doi: 10.1016/j.isci.2023.106761. eCollection 2023 May 19.
10
Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflows.阿片类药物使用障碍患者的健康社会决定因素:通过跨领域合作以及可重复的、以数据为中心的笔记本式工作流程来改善数据收集、整合与使用。
Front Med (Lausanne). 2023 Mar 2;10:1076794. doi: 10.3389/fmed.2023.1076794. eCollection 2023.