Department of Computer Science and Engineering, University of Moratuwa, Bandaranayake Mawatha, Moratuwa 10400, Sri Lanka.
School of Computing, Australian National University, Canberra ACT 2600, Australia; Flinders Accelerator for Microbiome Exploration, Flinders University, Bedford Park SA 5042, Australia.
Comput Biol Chem. 2022 Oct;100:107734. doi: 10.1016/j.compbiolchem.2022.107734. Epub 2022 Jul 14.
Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high dimensional space. We propose CH-Bin, a binning approach that leverages the benefits of using convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the convex hull distance based binning approach can be effectively utilized in binning such high dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a convex hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at https://github.com/kdsuneraavinash/CH-Bin.
宏基因组学使对环境样本中存在的微生物进行非培养分析成为可能。宏基因组binning 是典型宏基因组工作流程中组装后的一个重要步骤,它涉及将 contigs 分组到代表不同分类群的 bins 中。大多数宏基因组 binning 工具将 contigs 的组成和覆盖信息表示为由大量维度组成的特征向量。然而,这些工具使用传统的欧几里得距离或曼哈顿距离度量,在高维空间中变得不可靠。我们提出了 CH-Bin,这是一种利用凸壳距离进行 binning 的方法,用于对由高维特征向量表示的 contigs 进行 binning。我们通过在模拟和真实数据集上的实验证据证明,使用高维特征向量来表示 contigs 可以保留额外的信息,并获得更好的 binning 结果。我们进一步证明,基于凸壳距离的 binning 方法可以有效地用于 binning 这种高维数据。据我们所知,这是第一次使用多种大小的寡核苷酸的组成信息来表示 contigs 的组成信息,并且使用基于凸壳距离的 binning 算法来 bin 宏基因组 contigs。CH-Bin 的源代码可在 https://github.com/kdsuneraavinash/CH-Bin 上获得。