Tri-Institutional Computational Biology and Medicine Program, Weill Cornell Medicine of Cornell University, New York, New York 10065, USA.
Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, New York 10065, USA.
Genome Res. 2019 Jan;29(1):116-124. doi: 10.1101/gr.235499.118. Epub 2018 Dec 6.
Emerging Linked-Read technologies (aka read cloud or barcoded short-reads) have revived interest in short-read technology as a viable approach to understand large-scale structures in genomes and metagenomes. Linked-Read technologies, such as the 10x Chromium system, use a microfluidic system and a specialized set of 3' barcodes (aka UIDs) to tag short DNA reads sourced from the same long fragment of DNA; subsequently, the tagged reads are sequenced on standard short-read platforms. This approach results in interesting compromises. Each long fragment of DNA is only sparsely covered by reads, no information about the ordering of reads from the same fragment is preserved, and 3' barcodes match reads from roughly 2-20 long fragments of DNA. However, compared to long-read technologies, the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads. In this paper, we formally describe a particular algorithmic issue common to Linked-Read technology: the deconvolution of reads with a single 3' barcode into clusters that represent single long fragments of DNA. We introduce Minerva, a graph-based algorithm that approximately solves the barcode deconvolution problem for metagenomic data (where reference genomes may be incomplete or unavailable). Additionally, we develop two demonstrations where the deconvolution of barcoded reads improves downstream results, improving the specificity of taxonomic assignments and of -mer-based clustering. To the best of our knowledge, we are the first to address the problem of barcode deconvolution in metagenomics.
新兴的链接读取技术(又名读取云或带条码的短读取)重新激发了人们对短读取技术的兴趣,认为其是一种理解基因组和宏基因组中大规模结构的可行方法。链接读取技术,如 10x Chromium 系统,使用微流控系统和一组专门的 3' 条码(又名 UIDs)来标记来自同一长 DNA 片段的短 DNA 读取;随后,标记的读取在标准短读取平台上进行测序。这种方法带来了一些有趣的折衷。每个长 DNA 片段仅被读取稀疏地覆盖,没有保留来自同一片段的读取顺序的信息,并且 3' 条码匹配来自大约 2-20 个长 DNA 片段的读取。然而,与长读取技术相比,测序的每个碱基的成本要低得多,所需的输入 DNA 要少得多,并且每个碱基的错误率与 Illumina 短读取相同。在本文中,我们正式描述了链接读取技术中常见的特定算法问题:将具有单个 3' 条码的读取解卷积成代表单个长 DNA 片段的簇。我们引入了 Minerva,这是一种基于图的算法,可近似解决宏基因组数据中的条码解卷积问题(其中参考基因组可能不完整或不可用)。此外,我们开发了两个演示,其中条码读取的解卷积可改善下游结果,提高分类分配和 -mer 聚类的特异性。据我们所知,我们是第一个解决宏基因组学中条码解卷积问题的。