Center for Research in Biological Systems, University of California San Diego, USA.
Brief Bioinform. 2012 Nov;13(6):656-68. doi: 10.1093/bib/bbs035. Epub 2012 Jul 6.
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.
高通量测序技术的快速发展极大地推动了对存在于各种环境中的微生物群落的宏基因组研究。宏基因组学中的基本问题包括微生物种群的身份、组成和动态及其功能和相互作用。然而,这些序列数据的海量和综合复杂性在数据分析方面带来了巨大的挑战。这些挑战包括但不限于不断增加的计算需求、序列采样偏差、序列错误、序列伪影和新序列。序列聚类方法可以通过将相似的序列分组到家族中,直接回答许多基本问题。此外,聚类分析也解决了宏基因组学中的挑战。因此,大量的冗余数据集可以用一个小的非冗余集来表示,其中每个聚类都可以用单个条目或共识来表示。通过聚类可以快速检测伪影。可以通过使用聚类中序列的共识来识别、过滤或纠正错误。