Department of Computer Science, Princeton University, Princeton, NJ, USA.
Department of Computer Science, Brown University, Providence, RI, USA.
Bioinformatics. 2018 Jul 1;34(13):i211-i217. doi: 10.1093/bioinformatics/bty286.
Current technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants.
We show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb-comparable to typical gene lengths-compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations.
Source code is available at https://www.github.com/raphael-group.
Supplementary data are available at Bioinformatics online.
目前用于单细胞 DNA 测序的技术需要全基因组扩增 (WGA),因为单个细胞中的 DNA 太少,无法直接测序。不幸的是,WGA 会在测序数据中引入偏差,包括基因组覆盖的不均匀性和等位基因缺失率高。这些偏差使许多下游分析变得复杂,包括基因组变异的检测。
我们表明,扩增偏差有一个潜在的好处:等位基因缺失率的长程相关性为 WGA 扩增产物的单倍型分相提供了信号,这些长度通常比单个序列读取长。我们描述了一种统计测试方法,用于测量多个测序单细胞中单核苷酸多态性 (SNP) 之间同时发生的等位基因缺失。我们使用该测试的结果在一组单细胞中进行单倍型组装。我们证明,该算法在预测 SNP 对之间的相位方面比单独使用读取的相位具有更高的准确性。仅使用来自七个神经细胞的全基因组测序数据,我们获得了比单独使用序列读取长几个数量级的单倍型块(中位数长度为 10.2kb 与 312bp),错误率<2%。我们在来自 16 个细胞的全外显子组数据上也证明了类似的优势,其中我们获得的单倍型块的中位数长度为 9.2kb-与典型基因长度相当-与单独使用序列读取时的中位数长度 41bp 相比,错误率<4%。我们的算法将有助于稀有等位基因的单倍型分析和等位基因特异性体细胞突变的研究。
源代码可在 https://www.github.com/raphael-group 上获得。
补充数据可在 Bioinformatics 在线获得。