癌症基因表达数据聚类：一项比较研究。

Clustering cancer gene expression data: a comparative study.

作者信息

de Souto Marcilio C P, Costa Ivan G, de Araujo Daniel S A, Ludermir Teresa B, Schliep Alexander

机构信息

Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.

出版信息

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

DOI:10.1186/1471-2105-9-497

PMID:19038021

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2632677/

Abstract

BACKGROUND

The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context.

RESULTS/CONCLUSION: We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at http://algorithmics.molgen.mpg.de/Supplements/CompCancer/.

摘要

背景

使用聚类方法来发现癌症亚型已在科学界引起了广泛关注。虽然生物信息学家提出了利用基因表达数据特征的新聚类方法，但医学界更倾向于使用“经典”聚类方法。到目前为止，尚无研究在这种背景下对不同聚类方法进行大规模评估。

结果/结论：我们首次对七种不同的聚类方法和四种相近性度量进行了大规模分析，以分析35个癌症基因表达数据集。我们的结果表明，高斯有限混合模型紧随k均值聚类法之后，在恢复数据集的真实结构方面表现最佳。根据我们的验证标准，这些方法平均而言在数据集的实际类别数量与最佳聚类数量之间的差异也最小。此外，医学界广泛使用的层次聚类方法，其恢复性能比其他评估方法要差。而且，作为评估和比较癌症基因表达数据不同聚类方法的稳定基础，本研究提供了一组共同的数据集（基准数据集）供研究人员共享，并用于与新方法进行比较。本研究中分析的数据集可从http://algorithmics.molgen.mpg.de/Supplements/CompCancer/获取。

相似文献

Clustering cancer gene expression data: a comparative study.

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

GenClust: a genetic algorithm for clustering gene expression data.

BMC Bioinformatics. 2005 Dec 7;6:289. doi: 10.1186/1471-2105-6-289.

Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm.

Bioinformatics. 2006 Jan 1;22(1):58-67. doi: 10.1093/bioinformatics/bti746. Epub 2005 Oct 27.

Modeling and visualizing uncertainty in gene expression clusters using dirichlet process mixtures.

IEEE/ACM Trans Comput Biol Bioinform. 2009 Oct-Dec;6(4):615-28. doi: 10.1109/TCBB.2007.70269.

Comparing the performance of biomedical clustering methods.

Nat Methods. 2015 Nov;12(11):1033-8. doi: 10.1038/nmeth.3583. Epub 2015 Sep 21.

Comparisons and validation of statistical clustering techniques for microarray gene expression data.

Bioinformatics. 2003 Mar 1;19(4):459-66. doi: 10.1093/bioinformatics/btg025.

Simultaneous gene clustering and subset selection for sample classification via MDL.

Bioinformatics. 2003 Jun 12;19(9):1100-9. doi: 10.1093/bioinformatics/btg039.

Clustering of gene expression data: performance and similarity analysis.

BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S19. doi: 10.1186/1471-2105-7-S4-S19.

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.

BMC Bioinformatics. 2005 Apr 13;6:97. doi: 10.1186/1471-2105-6-97.

Evaluation of clustering algorithms for gene expression data.

BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S17. doi: 10.1186/1471-2105-7-S4-S17.

引用本文的文献

Clustering of electronic health records in atrial fibrillation patients and impact on prognosis and patient trajectories: a UK linked-dataset study.

Eur Heart J Digit Health. 2025 Apr 5;6(4):797-810. doi: 10.1093/ehjdh/ztaf032. eCollection 2025 Jul.

A Deep Differential Analysis in Four Subtypes of Breast Cancer Based on Regulations of miRNA-mRNA.

IET Syst Biol. 2025 Jan-Dec;19(1):e70020. doi: 10.1049/syb2.70020.

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

Multi-way overlapping clustering by Bayesian tensor decomposition.

Stat Interface. 2024;17(2):219-230. doi: 10.4310/23-sii790. Epub 2024 Feb 1.

Evaluation of agreement between common clustering strategies for DNA methylation-based subtyping of breast tumours.

Epigenomics. 2025 Feb;17(2):105-114. doi: 10.1080/17501911.2024.2441653. Epub 2024 Dec 23.

Principles of artificial intelligence in radiooncology.

Strahlenther Onkol. 2025 Mar;201(3):210-235. doi: 10.1007/s00066-024-02272-0. Epub 2024 Aug 6.

Methods in DNA methylation array dataset analysis: A review.

Comput Struct Biotechnol J. 2024 May 17;23:2304-2325. doi: 10.1016/j.csbj.2024.05.015. eCollection 2024 Dec.

Multi-Input data ASsembly for joint Analysis (MIASA): A framework for the joint analysis of disjoint sets of variables.

PLoS One. 2024 May 10;19(5):e0302425. doi: 10.1371/journal.pone.0302425. eCollection 2024.

Somtimes: self organizing maps for time series clustering and its application to serious illness conversations.

Data Min Knowl Discov. 2024;38(3):813-839. doi: 10.1007/s10618-023-00979-9. Epub 2023 Oct 20.

MOBILE pipeline enables identification of context-specific networks and regulatory mechanisms.

Nat Commun. 2023 Jul 6;14(1):3991. doi: 10.1038/s41467-023-39729-2.

本文引用的文献

A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis.

Multivariate Behav Res. 1986 Oct 1;21(4):441-58. doi: 10.1207/s15327906mbr2104_5.

Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine.

Drug Discov Today. 2004 Dec 15;9(24 Suppl):S32-6.

A comparative study of different machine learning methods on microarray gene expression data.

BMC Genomics. 2008;9 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-9-S1-S13.

Techniques for clustering gene expression data.

Comput Biol Med. 2008 Mar;38(3):283-93. doi: 10.1016/j.compbiomed.2007.11.001. Epub 2007 Dec 3.

Evaluation of clustering algorithms for gene expression data.

BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S17. doi: 10.1186/1471-2105-7-S4-S17.

Integrative molecular concept modeling of prostate cancer progression.

Nat Genet. 2007 Jan;39(1):41-51. doi: 10.1038/ng1935. Epub 2006 Dec 17.

Metric for measuring the effectiveness of clustering of DNA microarray expression.

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.

NCBI GEO: mining tens of millions of expression profiles--database and tools update.

Nucleic Acids Res. 2007 Jan;35(Database issue):D760-5. doi: 10.1093/nar/gkl887. Epub 2006 Nov 11.

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

BMC Bioinformatics. 2006 Aug 31;7:397. doi: 10.1186/1471-2105-7-397.

Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis.

Oncogene. 2007 Jan 11;26(2):312-20. doi: 10.1038/sj.onc.1209778. Epub 2006 Jul 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

癌症基因表达数据聚类：一项比较研究。

Clustering cancer gene expression data: a comparative study.

作者信息

de Souto Marcilio C P, Costa Ivan G, de Araujo Daniel S A, Ludermir Teresa B, Schliep Alexander

机构信息

Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.

出版信息

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

DOI:10.1186/1471-2105-9-497

PMID:19038021

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2632677/

Abstract

BACKGROUND

摘要

癌症基因表达数据聚类：一项比较研究。

Clustering cancer gene expression data: a comparative study.

作者信息

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

癌症基因表达数据聚类：一项比较研究。

Clustering cancer gene expression data: a comparative study.

作者信息

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献