Cleary Brian, Brito Ilana Lauren, Huang Katherine, Gevers Dirk, Shea Terrance, Young Sarah, Alm Eric J
Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.
Nat Biotechnol. 2015 Oct;33(10):1053-60. doi: 10.1038/nbt.3329. Epub 2015 Sep 14.
Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.
对测序深度达数十亿或数万亿碱基的宏基因组数据集进行分析,可以发现数百个微生物基因组,但对这些数据进行简单组装计算量很大,需要数百吉字节到数太字节的随机存取存储器。我们提出了潜在菌株分析(LSA),这是一种可扩展的从头预组装方法,可将 reads 分离到具有生物学意义的分区中,从而实现单个基因组的组装。LSA 通过对我们称为特征基因组的未观察变量进行流式计算来实现。特征基因组反映了短的固定长度序列(即 k-mer)丰度的协方差。由于样本中每个基因组的丰度反映在该基因组中每个 k-mer 的丰度中,因此特征基因组分析可用于将来自不同基因组的 reads 进行分区。这种分区可以使用数十吉字节的随机存取存储器在固定内存中完成,这使得在商用硬件上对数太字节的数据进行组装和下游分析成为可能。使用 LSA,我们组装了相对丰度低至 0.00001% 的细菌类群的部分和近乎完整的基因组。我们还表明,LSA 足够灵敏,能够分离来自同一物种多个菌株的 reads。