基于基因组规模的聚类算法的系统比较。

A systematic comparison of genome-scale clustering algorithms.

机构信息

The Jackson Laboratory, Bar Harbor, ME 04609, USA.

出版信息

BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-13-S10-S7.

DOI:10.1186/1471-2105-13-S10-S7

PMID:22759431

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3382433/

Abstract

BACKGROUND

A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae.

METHODS

For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method.

RESULTS

Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods.

CONCLUSIONS

Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.

摘要

背景

大量的聚类算法已被应用于基因共表达实验。这些算法涵盖了广泛的方法，从传统的技术，如 k-均值和层次聚类，到图形方法，如 k-团社区、加权基因共表达网络（WGCNA）和并集。比较这些方法以评估它们的相对有效性，可以为算法的选择、开发和实施提供指导。大多数关于比较聚类评估的先前工作都集中在参数方法上。图论方法是用于全局分析和分解微阵列共表达矩阵的工具集的最新补充，这些方法通常不包括在早期的方法比较中。在本研究中，使用来自酿酒酵母的全基因组规模的特征明确的转录组数据，比较了各种参数和图论聚类算法。

方法

对于每种研究中的聚类方法，测试了多种参数。Jaccard 相似性用于测量每个簇与每个 GO 和 KEGG 注释集的一致性，并且将最高的 Jaccard 得分分配给该簇。将簇分为小、中、大三个 bin，并且将每个 bin 中得分最高的五个簇的 Jaccard 得分平均并报告为特定方法的最佳平均前 5 名（BAT5）得分。

结果

基于与已知途径的阳性匹配，评估每个方法产生的簇。这产生了一种对基因聚类相对有效性的易于解释的排序。还测试了方法，以确定它们是否能够识别与其他聚类方法识别的簇一致的簇。

结论

对已知基因分类的簇进行验证表明，对于此数据，基于图的技术优于传统聚类方法，这表明需要进一步开发和应用组合策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b0b/3382433/b0808420e923/1471-2105-13-S10-S7-1.jpg

相似文献

A systematic comparison of genome-scale clustering algorithms.

BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-13-S10-S7.

Comparisons of graph-structure clustering methods for gene expression data.

Acta Biochim Biophys Sin (Shanghai). 2006 Jun;38(6):379-84. doi: 10.1111/j.1745-7270.2006.00175.x.

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments.

BMC Genomics. 2010 Jan 7;11:15. doi: 10.1186/1471-2164-11-15.

Clustering of gene expression data: performance and similarity analysis.

BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S19. doi: 10.1186/1471-2105-7-S4-S19.

From co-expression to co-regulation: how many microarray experiments do we need?

Genome Biol. 2004;5(7):R48. doi: 10.1186/gb-2004-5-7-r48. Epub 2004 Jun 28.

Metric for measuring the effectiveness of clustering of DNA microarray expression.

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.

Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

BMC Bioinformatics. 2008 Jun 18;9:288. doi: 10.1186/1471-2105-9-288.

Analysis of a Gibbs sampler method for model-based clustering of gene expression data.

Bioinformatics. 2008 Jan 15;24(2):176-83. doi: 10.1093/bioinformatics/btm562. Epub 2007 Nov 22.

Clustering gene expression data using graph separators.

In Silico Biol. 2007;7(4-5):433-52.

Dynamically weighted clustering with noise set.

Bioinformatics. 2010 Feb 1;26(3):341-7. doi: 10.1093/bioinformatics/btp671. Epub 2009 Dec 9.

引用本文的文献

A Comparative Study of Gene Co-Expression Thresholding Algorithms.

J Comput Biol. 2024 Jun;31(6):539-548. doi: 10.1089/cmb.2024.0509. Epub 2024 May 23.

Association of whole-person eigen-polygenic risk scores with Alzheimer's disease.

Hum Mol Genet. 2024 Jul 22;33(15):1315-1327. doi: 10.1093/hmg/ddae067.

Seminar: Scalable Preprocessing Tools for Exposomic Data Analysis.

Environ Health Perspect. 2023 Dec;131(12):124201. doi: 10.1289/EHP12901. Epub 2023 Dec 18.

Machine Learning Prediction of Adenovirus D8 Conjunctivitis Complications from Viral Whole-Genome Sequence.

Ophthalmol Sci. 2022 May 10;2(4):100166. doi: 10.1016/j.xops.2022.100166. eCollection 2022 Dec.

Molecular Subtyping and Outlier Detection in Human Disease Using the Paraclique Algorithm.

Algorithms. 2021 Feb;14(2). doi: 10.3390/a14020063. Epub 2021 Feb 19.

Machine learning in postgenomic biology and personalized medicine.

Wiley Interdiscip Rev Data Min Knowl Discov. 2022 Mar-Apr;12(2). doi: 10.1002/widm.1451. Epub 2022 Jan 24.

A Clinical Investigation on the Theragnostic Effect of MicroRNA Biomarkers for Survival Outcome in Cervical Cancer: A PRISMA-P Compliant Protocol for Systematic Review and Comprehensive Meta-Analysis.

Genes (Basel). 2022 Mar 5;13(3):463. doi: 10.3390/genes13030463.

Molecular Investigation of miRNA Biomarkers as Chemoresistance Regulators in Melanoma: A Protocol for Systematic Review and Meta-Analysis.

Genes (Basel). 2022 Jan 8;13(1):115. doi: 10.3390/genes13010115.

Genomic Metrics Applied to (): Species Reclassification, Identification of Unauthentic Genomes and False Type Strains.

Front Microbiol. 2021 Mar 25;12:614957. doi: 10.3389/fmicb.2021.614957. eCollection 2021.

Genetic Diversity Among Subspecies Revealed by Analysis of Complete Genome Sequences.

Front Microbiol. 2020 Aug 7;11:1701. doi: 10.3389/fmicb.2020.01701. eCollection 2020.

本文引用的文献

Clustering cancer gene expression data: a comparative study.

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.

BMC Bioinformatics. 2008 Oct 29;9:462. doi: 10.1186/1471-2105-9-462.

Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

BMC Bioinformatics. 2008 Jun 18;9:288. doi: 10.1186/1471-2105-9-288.

KEGG for linking genomes to life and the environment.

Nucleic Acids Res. 2008 Jan;36(Database issue):D480-4. doi: 10.1093/nar/gkm882. Epub 2007 Dec 12.

Techniques for clustering gene expression data.

Comput Biol Med. 2008 Mar;38(3):283-93. doi: 10.1016/j.compbiomed.2007.11.001. Epub 2007 Dec 3.

The Pfam protein families database.

Nucleic Acids Res. 2008 Jan;36(Database issue):D281-8. doi: 10.1093/nar/gkm960. Epub 2007 Nov 26.

The 20 years of PROSITE.

Nucleic Acids Res. 2008 Jan;36(Database issue):D245-9. doi: 10.1093/nar/gkm977. Epub 2007 Nov 14.

Nearest Neighbor Networks: clustering expression data based on gene neighborhoods.

BMC Bioinformatics. 2007 Jul 12;8:250. doi: 10.1186/1471-2105-8-250.

Consensus framework for exploring microarray data using multiple clustering methods.

OMICS. 2007 Spring;11(1):116-28. doi: 10.1089/omi.2006.0008.

New developments in the InterPro database.

Nucleic Acids Res. 2007 Jan;35(Database issue):D224-8. doi: 10.1093/nar/gkl841.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于基因组规模的聚类算法的系统比较。

A systematic comparison of genome-scale clustering algorithms.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献