利用 k- -mer 估算宏基因组样本的总基因组长度。

Estimating the total genome length of a metagenomic sample using k-mers.

机构信息

MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST, Beijing, 100084, China.

Department of Automation, Tsinghua University, Beijing, 100084, China.

出版信息

BMC Genomics. 2019 Apr 4;20(Suppl 2):183. doi: 10.1186/s12864-019-5467-x.

DOI:10.1186/s12864-019-5467-x

PMID:30967110

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6456951/

Abstract

BACKGROUND

Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.

RESULTS

As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.

CONCLUSIONS

We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

摘要

背景

宏基因组测序是一种强大的技术，可用于研究人类和环境中微生物或微生物组的混合物。分析宏基因组数据的基本任务之一是识别群落中的成分基因组。由于微生物组组成的复杂性、已知参考基因组的有限可用性以及通常测序覆盖度不足，这项任务具有挑战性。

结果

作为理解宏基因组样本完整组成的初步步骤，我们研究了估计宏基因组样本中所有不同成分基因组总长度的问题。我们表明，通过估计所有宏基因组测序数据中不同 k-mer 的总数，可以解决这个问题。我们提出了一种基于观察到的 k-mer 测序覆盖分布的估计方法，并引入了 k-mer 冗余指数（KRI）来填补计数的不同 k-mer 与总基因组长度之间的差距。我们在一组精心设计的模拟数据上展示了所提出方法的有效性，这些模拟数据对应于真实宏基因组数据的多种情况。在真实数据上的结果表明，未捕获的基因组信息在不同的宏基因组样本中差异很大，这可能会误导下游分析。