可口可乐：利用序列组成、读段覆盖度、共比对和双端读段连接对宏基因组重叠群进行分箱。

COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.

作者信息

Lu Yang Young, Chen Ting, Fuhrman Jed A, Sun Fengzhu

机构信息

Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA.

Center for Synthetic and Systems Biology, TNLIST, Beijing, China.

出版信息

Bioinformatics. 2017 Mar 15;33(6):791-798. doi: 10.1093/bioinformatics/btw290.

DOI:10.1093/bioinformatics/btw290

PMID:27256312

Abstract

MOTIVATION

The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples.

RESULTS

The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT.

AVAILABILITY AND IMPLEMENTATION

The software is available at https://github.com/younglululu/COCACOLA .

CONTACT

fsun@usc.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

新一代测序技术的出现使研究人员能够直接从环境中对复杂的微生物群落进行测序。由于组装通常只能产生基因组片段（也称为重叠群），而不是完整的基因组，因此将它们分组为操作分类单元（OTU）以进行进一步的分类分析和下游功能分析至关重要。OTU聚类也称为分箱。我们提出了COCACOLA，这是一个基于多个样本的序列组成和覆盖度自动将重叠群分箱为OTU的通用框架。

结果

与CONCOCT、GroopM、MaxBin和MetaBAT等先进的分箱方法相比，COCACOLA在模拟数据集和真实数据集中均展示了有效性。COCACOLA的卓越性能体现在两个方面。一是在初始化过程中使用L1距离而非欧几里得距离以实现更好的分类识别。更重要的是，COCACOLA通过稀疏正则化同时利用了硬聚类和软聚类。此外，COCACOLA框架无缝整合定制知识以提高分箱准确性。在我们的研究中，我们研究了两种额外的知识，即与参考基因组的共比对和双端读段提供的重叠群的连锁关系，以及两者的结合。我们发现，在大多数情况下，共比对和连锁信息都能进一步改善分箱效果。COCACOLA具有可扩展性，并且比分箱方法CONCOCT、GroopM、MaxBin和MetaBAT更快。