Illumina平台深度测序错误的可重复性使得能够准确测定细胞中的DNA条形码。

Reproducibility of Illumina platform deep sequencing errors allows accurate determination of DNA barcodes in cells.

作者信息

Beltman Joost B, Urbanus Jos, Velds Arno, van Rooij Nienke, Rohr Jan C, Naik Shalin H, Schumacher Ton N

机构信息

Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.

Division of Toxicology, Leiden Academic Centre for Drug Research, Leiden University, 2333 CC, Leiden, The Netherlands.

出版信息

BMC Bioinformatics. 2016 Apr 2;17:151. doi: 10.1186/s12859-016-0999-4.

Abstract

BACKGROUND

Next generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags.

RESULTS

Here, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences.

CONCLUSIONS

Application of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.

摘要

背景

扩增DNA的下一代测序(NGS)是一种强大的工具,可用于描述细胞群体内的遗传异质性,既能用于研究细胞群体的克隆结构,也能用于进行遗传谱系追踪。对于丰富序列和稀有序列在生物学上均相关的应用而言,NGS技术相对较高的错误率使数据分析变得复杂,因为很难将由PCR或测序错误产生的假序列与稀有的真实序列区分开来。例如,这个问题适用于细胞条形码策略,该策略旨在通过为单个细胞提供独特的可遗传DNA标签来追踪单细胞后代的数量和类型。

结果

在这里,我们使用来自Illumina HiSeq平台的遗传条形码数据表明,基于直接读取阈值的数据过滤通常不足以滤除假条形码。重要的是,我们证明特定的测序错误在并行测序的不同样本中以大致恒定的速率出现。我们利用这一观察结果开发了一种新方法来滤除假序列。

结论

我们新方法的应用证明了其在生物数据集中从假序列中识别真实序列的价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0988/4818877/a4cd60b97936/12859_2016_999_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索