Ronan Tom, Qi Zhijie, Naegle Kristen M
Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA.
Sci Signal. 2016 Jun 14;9(432):re6. doi: 10.1126/scisignal.aad1932.
Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data.
聚类是一种无监督学习方法,它根据相似性对数据点进行分组,并用于揭示数据的潜在结构。这种计算方法对于理解和可视化在高通量多维生物学实验中获取的复杂数据至关重要。聚类使研究人员能够为进一步的实验做出生物学推断。尽管聚类是一种强大的技术,但不当应用可能会导致生物学研究人员在后续实验中浪费资源和时间。我们回顾了从已发表的分子生物学文献中识别出的常见陷阱,并提出了避免这些陷阱的方法。常见的陷阱涉及高通量实验产生的生物学数据的高维性质、针对给定问题未能考虑多种聚类方法以及难以确定聚类是否产生了有意义的结果。我们以简单问题和实际生物学数据的形式给出了这些问题及解决方案(聚类结果)的具体示例。我们还讨论了集成聚类,它是一种易于实现的方法,能够探索多种聚类解决方案并提高聚类解决方案的稳健性。提高对常见聚类陷阱的认识将有助于研究人员在对生物学数据进行聚类时避免过度解读或错误解读结果以及错过有价值的见解。