基于改进模糊 C 均值方法的宏基因组组装 contigs 无监督分箱。

Unsupervised Binning of Metagenomic Assembled Contigs Using Improved Fuzzy C-Means Method.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1459-1467. doi: 10.1109/TCBB.2016.2576452. Epub 2016 Jun 7.

DOI:10.1109/TCBB.2016.2576452

Abstract

Metagenomic contigs binning is a necessary step of metagenome analysis. After assembly, the number of contigs belonging to different genomes is usually unequal. So a metagenomic contigs dataset is a kind of imbalanced dataset and traditional fuzzy c-means method (FCM) fails to handle it very well. In this paper, we will introduce an improved version of fuzzy c-means method (IFCM) into metagenomic contigs binning. First, tetranucleotide frequencies are calculated for every contig. Second, the number of bins is roughly estimated by the distribution of genome lengths of a complete set of non-draft sequenced microbial genomes from NCBI. Then, IFCM is used to cluster DNA contigs with the estimated result. Finally, a clustering validity function is utilized to determine the binning result. We tested this method on a synthetic and two real datasets and experimental results have showed the effectiveness of this method compared with other tools.

摘要

宏基因组序列聚类是宏基因组分析的必要步骤。在组装之后，属于不同基因组的序列的数量通常是不相等的。因此，宏基因组序列数据集是一种不平衡数据集，传统的模糊 c 均值方法（FCM）很难很好地处理它。在本文中，我们将引入一种改进的模糊 c 均值方法（IFCM）到宏基因组序列聚类中。首先，计算每个序列的四核苷酸频率。其次，通过 NCBI 中一组完整的非草图微生物基因组的长度分布来大致估计聚类的数量。然后，使用 IFCM 对 DNA 序列进行聚类，聚类结果使用聚类有效性函数来确定。我们在一个合成数据集和两个真实数据集上测试了这种方法，实验结果表明，与其他工具相比，该方法是有效的。