Zou Quan, Lin Gang, Jiang Xingpeng, Liu Xiangrong, Zeng Xiangxiang
Tianjin University.
University of Electronic Science and Technology of China.
Brief Bioinform. 2020 Jan 17;21(1):1-10. doi: 10.1093/bib/bby090.
Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers and bioinformatics users. Different software tools can produce diverse results and users can find them difficult to analyze. Understanding the different clustering mechanisms is crucial to understanding the results that they produce. In this review, we selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters and compared them using two independent benchmark datasets. Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data. Related data, codes and software tools were accessible at the link http://lab.malab.cn/∼lg/clustering/.
序列聚类是一项基本的生物信息学任务,随着宏基因组学和微生物组学的发展,它正重新受到关注。最新的测序技术降低了成本,因此正在产生大量的DNA/RNA序列。挑战在于使用稳定、快速且准确的方法对序列数据进行聚类。对于微生物组测序数据,通常使用16S核糖体RNA操作分类单元。然而,算法开发者和生物信息学用户之间往往存在差距。不同的软件工具可能会产生不同的结果,用户可能会发现难以对其进行分析。理解不同的聚类机制对于理解它们所产生的结果至关重要。在本综述中,我们选择了几种流行的聚类工具,简要解释了关键计算原理,分析了它们的特点,并使用两个独立的基准数据集对它们进行了比较。我们的目的是帮助生物信息学用户有效地使用合适的聚类工具来分析大型测序数据。相关数据、代码和软件工具可通过链接http://lab.malab.cn/∼lg/clustering/获取。