Illumina平台深度测序错误的可重复性使得能够准确测定细胞中的DNA条形码。
Reproducibility of Illumina platform deep sequencing errors allows accurate determination of DNA barcodes in cells.
作者信息
Beltman Joost B, Urbanus Jos, Velds Arno, van Rooij Nienke, Rohr Jan C, Naik Shalin H, Schumacher Ton N
机构信息
Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.
Division of Toxicology, Leiden Academic Centre for Drug Research, Leiden University, 2333 CC, Leiden, The Netherlands.
出版信息
BMC Bioinformatics. 2016 Apr 2;17:151. doi: 10.1186/s12859-016-0999-4.
BACKGROUND
Next generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags.
RESULTS
Here, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences.
CONCLUSIONS
Application of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.
背景
扩增DNA的下一代测序(NGS)是一种强大的工具,可用于描述细胞群体内的遗传异质性,既能用于研究细胞群体的克隆结构,也能用于进行遗传谱系追踪。对于丰富序列和稀有序列在生物学上均相关的应用而言,NGS技术相对较高的错误率使数据分析变得复杂,因为很难将由PCR或测序错误产生的假序列与稀有的真实序列区分开来。例如,这个问题适用于细胞条形码策略,该策略旨在通过为单个细胞提供独特的可遗传DNA标签来追踪单细胞后代的数量和类型。
结果
在这里,我们使用来自Illumina HiSeq平台的遗传条形码数据表明,基于直接读取阈值的数据过滤通常不足以滤除假条形码。重要的是,我们证明特定的测序错误在并行测序的不同样本中以大致恒定的速率出现。我们利用这一观察结果开发了一种新方法来滤除假序列。
结论
我们新方法的应用证明了其在生物数据集中从假序列中识别真实序列的价值。
相似文献
Mitochondrial DNA A DNA Mapp Seq Anal. 2019-4
Proc Natl Acad Sci U S A. 2018-6-20
BMC Bioinformatics. 2015-2-18
引用本文的文献
Nat Comput Sci. 2024-2
Brief Bioinform. 2021-9-2
Nat Protoc. 2021-4
PLoS Comput Biol. 2018-2-12
Sci Rep. 2017-3-3
本文引用的文献
Bioinformatics. 2015-6-15
Nucleic Acids Res. 2014
Exp Hematol. 2014-7-1
Nat Methods. 2014-5-4
PLoS Comput Biol. 2013-12-12
Proc Natl Acad Sci U S A. 2013-11-15