Division of Biology and Biological Engineering, California Institute of Technology, 116 Kerckhoff Laboratory, Pasadena, CA, 91125, USA.
Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, 116 Kerckhoff Laboratory, Pasadena, CA, 91125, USA.
BMC Bioinformatics. 2019 Jan 17;20(1):32. doi: 10.1186/s12859-019-2612-0.
Single-cell sequencing experiments use short DNA barcode 'tags' to identify reads that originate from the same cell. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes.
Here we present an approach to identify and error-correct barcodes by traversing the de Bruijn graph of circularized barcode k-mers. Our approach is based on the observation that circularizing a barcode sequence can yield error-free k-mers even when the size of k is large relative to the length of the barcode sequence, a regime which is typical single-cell barcoding applications. This allows for assignment of reads to consensus fingerprints constructed from k-mers.
We show that for single-cell RNA-Seq circularization improves the recovery of accurate single-cell transcriptome estimates, especially when there are a high number of errors per read. This approach is robust to the type of error (mismatch, insertion, deletion), as well as to the relative abundances of the cells. Sircel, a software package that implements this approach is described and publically available.
单细胞测序实验使用短 DNA 条码“标签”来识别来自同一细胞的读取。为了从这类实验中恢复单细胞信息,必须根据其条码标签对读取进行分组,这是在其他计算之前的关键处理步骤。然而,由于条码可能会出现高错配和删除错误,因此此步骤可能会很困难。
我们在此提出了一种通过遍历圆形化条码 k-mer 的 de Bruijn 图来识别和纠正条码的方法。我们的方法基于这样的观察结果:即使当 k 的大小相对于条码序列的长度较大时,圆形化条码序列也可以产生无错误的 k-mer,这种情况在典型的单细胞条码应用中很常见。这允许将读取分配给由 k-mer 构建的共识指纹。
我们表明,对于单细胞 RNA-Seq,圆形化可以提高准确的单细胞转录组估计的恢复,特别是在每个读取有大量错误的情况下。这种方法对错误类型(错配、插入、删除)以及细胞的相对丰度都具有鲁棒性。描述并公开了一个实现该方法的软件包 Sircel。