BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S13. doi: 10.1186/1471-2105-15-S9-S13. Epub 2014 Sep 10.
Acquiring genomes at single-cell resolution has many applications such as in the study of microbiota. However, deep sequencing and assembly of all of millions of cells in a sample is prohibitively costly. A property that can come to rescue is that deep sequencing of every cell should not be necessary to capture all distinct genomes, as the majority of cells are biological replicates. Biologically important samples are often sparse in that sense. In this paper, we propose an adaptive compressed method, also known as distilled sensing, to capture all distinct genomes in a sparse microbial community with reduced sequencing effort. As opposed to group testing in which the number of distinct events is often constant and sparsity is equivalent to rarity of an event, sparsity in our case means scarcity of distinct events in comparison to the data size. Previously, we introduced the problem and proposed a distilled sensing solution based on the breadth first search strategy. We simulated the whole process which constrained our ability to study the behavior of the algorithm for the entire ensemble due to its computational intensity.
In this paper, we modify our previous breadth first search strategy and introduce the depth first search strategy. Instead of simulating the entire process, which is intractable for a large number of experiments, we provide a dynamic programming algorithm to analyze the behavior of the method for the entire ensemble. The ensemble analysis algorithm recursively calculates the probability of capturing every distinct genome and also the expected total sequenced nucleotides for a given population profile. Our results suggest that the expected total sequenced nucleotides grows proportional to log of the number of cells and proportional linearly with the number of distinct genomes. The probability of missing a genome depends on its abundance and the ratio of its size over the maximum genome size in the sample. The modified resource allocation method accommodates a parameter to control that probability.
The squeezambler 2.0 C++ source code is available at http://sourceforge.net/projects/hyda/.
单细胞分辨率下获取基因组有许多应用,例如在微生物组学研究中。然而,对样本中数百万个细胞进行深度测序和组装的成本非常高。有一种可以挽救的特性,即不需要对每个细胞进行深度测序,就可以捕获所有不同的基因组,因为大多数细胞都是生物复制。从这个意义上说,生物学上重要的样本通常是稀疏的。在本文中,我们提出了一种自适应压缩方法,也称为蒸馏感应,以减少测序工作量来捕获稀疏微生物群落中的所有不同基因组。与组测试不同,组测试中不同事件的数量通常是常数,而稀疏性相当于事件的稀有性,在我们的情况下,稀疏性意味着与数据大小相比,不同事件的稀缺性。此前,我们介绍了这个问题,并提出了一种基于广度优先搜索策略的蒸馏感应解决方案。我们模拟了整个过程,由于其计算强度,我们的能力受到限制,无法研究算法在整个集合中的行为。
在本文中,我们修改了之前的广度优先搜索策略,并引入了深度优先搜索策略。我们没有模拟整个过程,因为对于大量实验来说,这是难以处理的,而是提供了一种动态规划算法来分析整个集合中方法的行为。集合分析算法递归地计算捕获每个不同基因组的概率,以及给定种群分布的总测序核苷酸的期望。我们的结果表明,预期总测序核苷酸数与细胞数的对数成正比,与样本中不同基因组数成正比。错过一个基因组的概率取决于其丰度以及其大小与样本中最大基因组大小的比值。修改后的资源分配方法可以容纳一个参数来控制该概率。
squeezambler 2.0 C++源代码可在 http://sourceforge.net/projects/hyda/ 获得。