MetaCC 允许对长读长和短读短宏基因组 Hi-C 数据进行可扩展和综合分析。

MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data.

机构信息

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.

出版信息

Nat Commun. 2023 Oct 6;14(1):6231. doi: 10.1038/s41467-023-41209-6.

DOI:10.1038/s41467-023-41209-6

PMID:37802989

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10558524/

Abstract

Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analysis methods are developed and benchmarked on short-read metaHi-C datasets and there exists much room for improvement in terms of more scalable and stable analyses, especially for long-read metaHi-C data. Here we report MetaCC, an efficient and integrative framework for analyzing both short-read and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning. In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than the current state-of-the-art method HiCzin on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module can retrieve 709 high-quality genomes with the largest species diversity using one single sample, including an expansion of five uncultured members from the order Erysipelotrichales, and is the only binner that can recover the genome of one important species Bacteroides vulgatus. Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy plasmids.

摘要

宏基因组 Hi-C（metaHi-C）可以根据同一物理细胞内的接近程度来识别contig 之间的关系。metaHi-C 实验中的鸟枪法文库可以通过下一代测序（短读长 metaHi-C）或最近的第三代测序（长读长 metaHi-C）构建。然而，所有现有的 metaHi-C 分析方法都是在短读长 metaHi-C 数据集上开发和基准测试的，在更具可扩展性和稳定性的分析方面还有很大的改进空间，特别是对于长读长 metaHi-C 数据。在这里，我们报告了 MetaCC，这是一个用于分析短读长和长读长 metaHi-C 数据集的高效综合框架。MetaCC 在归一化和分箱方面优于现有方法。特别是，MetaCC 的归一化模块名为 NormCC，在一个复杂的废水数据集上的速度比当前最先进的方法 HiCzin 快 3000 多倍。当应用于一个绵羊肠道长读长 metaHi-C 数据集时，MetaCC 分箱模块可以使用单个样本检索到 709 个具有最大物种多样性的高质量基因组，包括从肠杆菌目中扩展的五个未培养成员，并且是唯一能够恢复重要物种 Bacteroides vulgatus 基因组的分箱器。进一步的质粒分析表明，MetaCC 分箱能够捕获多拷贝质粒。