Parekh Swati, Ziegenhain Christoph, Vieth Beate, Enard Wolfgang, Hellmann Ines
Anthropology &Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, 82152 Martinsried, Germany.
Sci Rep. 2016 May 9;6:25533. doi: 10.1038/srep25533.
Currently, quantitative RNA-seq methods are pushed to work with increasingly small starting amounts of RNA that require amplification. However, it is unclear how much noise or bias amplification introduces and how this affects precision and accuracy of RNA quantification. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified. Computationally, read duplicates are defined by their mapping position, which does not distinguish PCR- from natural duplicates and hence it is unclear how to treat duplicated reads. Here, we generate and analyse RNA-seq data sets prepared using three different protocols (Smart-Seq, TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates are not PCR duplicates and can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does improve neither accuracy nor precision and can actually worsen the power and the False Discovery Rate (FDR) for differential gene expression. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and FDR are only mildly improved. However, the pooling of samples as made possible by the early barcoding of the UMI-protocol leads to an appreciable increase in the power to detect differentially expressed genes.
目前,定量RNA测序方法正被用于处理起始RNA量越来越少且需要扩增的样本。然而,尚不清楚扩增会引入多少噪声或偏差,以及这如何影响RNA定量的精度和准确性。为了评估扩增的影响,需要识别源自同一RNA分子的reads(PCR重复序列)。在计算上,reads重复序列是由它们的映射位置定义的,这无法区分PCR重复序列和天然重复序列,因此不清楚如何处理重复的reads。在这里,我们生成并分析了使用三种不同方案(Smart-Seq、TruSeq和UMI-seq)制备的RNA测序数据集。我们发现,计算识别出的大部分reads重复序列并非PCR重复序列,而是可以用抽样和片段化偏差来解释。因此,通过计算去除重复序列既不能提高准确性也不能提高精度,实际上还可能降低差异基因表达的检验效能和错误发现率(FDR)。即使通过唯一分子标识符(UMI)在实验上识别出重复序列,检验效能和FDR也只是略有改善。然而,UMI方案早期条形码技术实现的样本合并,会显著提高检测差异表达基因的效能。