聚类生物数据时避免常见陷阱。

Avoiding common pitfalls when clustering biological data.

作者信息

Ronan Tom, Qi Zhijie, Naegle Kristen M

机构信息

Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA.

出版信息

Sci Signal. 2016 Jun 14;9(432):re6. doi: 10.1126/scisignal.aad1932.

DOI:10.1126/scisignal.aad1932

PMID:27303057

Abstract

Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data.

摘要

聚类是一种无监督学习方法，它根据相似性对数据点进行分组，并用于揭示数据的潜在结构。这种计算方法对于理解和可视化在高通量多维生物学实验中获取的复杂数据至关重要。聚类使研究人员能够为进一步的实验做出生物学推断。尽管聚类是一种强大的技术，但不当应用可能会导致生物学研究人员在后续实验中浪费资源和时间。我们回顾了从已发表的分子生物学文献中识别出的常见陷阱，并提出了避免这些陷阱的方法。常见的陷阱涉及高通量实验产生的生物学数据的高维性质、针对给定问题未能考虑多种聚类方法以及难以确定聚类是否产生了有意义的结果。我们以简单问题和实际生物学数据的形式给出了这些问题及解决方案（聚类结果）的具体示例。我们还讨论了集成聚类，它是一种易于实现的方法，能够探索多种聚类解决方案并提高聚类解决方案的稳健性。提高对常见聚类陷阱的认识将有助于研究人员在对生物学数据进行聚类时避免过度解读或错误解读结果以及错过有价值的见解。

相似文献

Avoiding common pitfalls when clustering biological data.

Sci Signal. 2016 Jun 14;9(432):re6. doi: 10.1126/scisignal.aad1932.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

The BioPrompt-box: an ontology-based clustering tool for searching in biological databases.

BMC Bioinformatics. 2007 Mar 8;8 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2105-8-S1-S8.

Unsupervised Structure Detection in Biomedical Data.

IEEE/ACM Trans Comput Biol Bioinform. 2015 Jul-Aug;12(4):753-60. doi: 10.1109/TCBB.2015.2394408.

Accounting for noise when clustering biological data.

Brief Bioinform. 2013 Jul;14(4):423-36. doi: 10.1093/bib/bbs057. Epub 2012 Oct 14.

Fuzzy ensemble clustering based on random projections for DNA microarray data analysis.

Artif Intell Med. 2009 Feb-Mar;45(2-3):173-83. doi: 10.1016/j.artmed.2008.07.014. Epub 2008 Sep 17.

Multi-view spectral clustering and its chemical application.

Int J Comput Biol Drug Des. 2013;6(1-2):32-49. doi: 10.1504/IJCBDD.2013.052200. Epub 2013 Feb 21.

Efficient clustering aggregation based on data fragments.

IEEE Trans Syst Man Cybern B Cybern. 2012 Jun;42(3):913-26. doi: 10.1109/TSMCB.2012.2183591. Epub 2012 Feb 10.

A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression.

BMC Bioinformatics. 2017 Jun 20;18(1):309. doi: 10.1186/s12859-017-1727-4.

Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number, K.

Sci Rep. 2015 Nov 19;5:16971. doi: 10.1038/srep16971.

引用本文的文献

An INS-1 832/13 𝛽-Cell Proteome Highlights the Rapid Regulation of Fatty Acid Biosynthesis in Glucose-Stimulated Insulin Secretion.

Proteomics. 2025 Aug;25(15):13-26. doi: 10.1002/pmic.70005. Epub 2025 Jul 20.

Pain in the brain: Psychological correlates of chronic pain and fibromyalgia.

PLoS One. 2025 Jun 11;20(6):e0324457. doi: 10.1371/journal.pone.0324457. eCollection 2025.

Clustering change patterns among learners of an online Recovery College in Quebec.

Front Psychiatry. 2025 May 27;16:1534349. doi: 10.3389/fpsyt.2025.1534349. eCollection 2025.

Why Has Biomarker-Guided Fluid Resuscitation for Sepsis Not Been Implemented in Clinical Practice?

Crit Care Explor. 2025 Jun 9;7(6):e1274. doi: 10.1097/CCE.0000000000001274. eCollection 2025 Jun 1.

A biological model of nonlinear dimensionality reduction.

Sci Adv. 2025 Feb 7;11(6):eadp9048. doi: 10.1126/sciadv.adp9048. Epub 2025 Feb 5.

Clustering affordable care act qualified health plans to understand how and where insurance facilitates or impedes access to HIV prevention.

AIDS Res Ther. 2024 Nov 19;21(1):83. doi: 10.1186/s12981-024-00674-9.

AAclust: -optimized clustering for selecting redundancy-reduced sets of amino acid scales.

Bioinform Adv. 2024 Oct 30;4(1):vbae165. doi: 10.1093/bioadv/vbae165. eCollection 2024.

Long Non-Coding RNAs, Nuclear Receptors and Their Cross-Talks in Cancer-Implications and Perspectives.

Cancers (Basel). 2024 Aug 22;16(16):2920. doi: 10.3390/cancers16162920.

Discovery and description of novel phage genomes from urban microbiomes sampled by the MetaSUB consortium.

Sci Rep. 2024 Apr 4;14(1):7913. doi: 10.1038/s41598-024-58226-0.

Subphenotypes in critical illness: a priori biological rationale is key.

Intensive Care Med. 2024 Feb;50(2):299-301. doi: 10.1007/s00134-023-07273-8. Epub 2023 Nov 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

聚类生物数据时避免常见陷阱。

Avoiding common pitfalls when clustering biological data.

作者信息

Ronan Tom, Qi Zhijie, Naegle Kristen M

机构信息

Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA.

出版信息

Sci Signal. 2016 Jun 14;9(432):re6. doi: 10.1126/scisignal.aad1932.

DOI:10.1126/scisignal.aad1932

PMID:27303057

Abstract

摘要

聚类生物数据时避免常见陷阱。

Avoiding common pitfalls when clustering biological data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

聚类生物数据时避免常见陷阱。

Avoiding common pitfalls when clustering biological data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献