Department of Energy Joint Genome Institute (DOE-JGI), Genome Biology Program, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA.
Bioinformatics. 2010 Feb 1;26(3):295-301. doi: 10.1093/bioinformatics/btp687. Epub 2009 Dec 14.
Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets.
Supplementary data are available at Bioinformatics online.
shotgun 测序从单个生物体或(在宏基因组学项目的情况下)从微生物群落的总基因组中生成大量短 DNA 读取。然后,根据重叠序列将这些读取组装成更大的连续序列(contigs)。组装的可行性和实现的覆盖度(每个核苷酸或核苷酸的不同序列的读取数)取决于几个因素:测序的读取数、读取长度以及其在微生物群落中的源基因组的相对丰度。低覆盖率表明样品中的大部分基因组 DNA 未被测序,但通常很难估计未捕获的多样性的程度或最有效的额外测序量。在这项工作中,我们将宏基因组视为 DNA 片段(bins)的群体,每个片段都可能被一个或多个读取覆盖。我们采用伽马分布来对该 bin 群体进行建模,因为它具有灵活性和易用性。当找到可以充分拟合数据的伽马近似时,我们可以估计未测序的 bin 数量以及通过额外测序可能揭示的 bin 数量。我们使用模拟宏基因组评估了该模型的性能,并在三个最近的宏基因组数据集上展示了其适用性。
补充数据可在生物信息学在线获得。