Sayols Sergi, Scherzinger Denise, Klein Holger
Bioinformatics Core Facility, Institute of Molecular Biology, Ackermannweg 4, 55128, Mainz, Germany.
Technische Hochschule Bingen, Berlinstraße 109, Bingen am Rhein, 55411, Germany.
BMC Bioinformatics. 2016 Oct 21;17(1):428. doi: 10.1186/s12859-016-1276-2.
PCR clonal artefacts originating from NGS library preparation can affect both genomic as well as RNA-Seq applications when protocols are pushed to their limits. In RNA-Seq however the artifactual reads are not easy to tell apart from normal read duplication due to natural over-sequencing of highly expressed genes. Especially when working with little input material or single cells assessing the fraction of duplicate reads is an important quality control step for NGS data sets. Up to now there are only tools to calculate the global duplication rates that do not take into account the effect of gene expression levels which leaves them of limited use for RNA-Seq data.
Here we present the tool dupRadar, which provides an easy means to distinguish the fraction of reads originating in natural duplication due to high expression from the fraction induced by artefacts. dupRadar assesses the fraction of duplicate reads per gene dependent on the expression level. Apart from the Bioconductor package dupRadar we provide shell scripts for easy integration into processing pipelines.
The Bioconductor package dupRadar offers straight-forward methods to assess RNA-Seq datasets for quality issues with PCR duplicates. It is aimed towards simple integration into standard analysis pipelines as a default QC metric that is especially useful for low-input and single cell RNA-Seq data sets.
当测序方案达到极限时,源自二代测序(NGS)文库制备的聚合酶链式反应(PCR)克隆假象会影响基因组以及RNA测序(RNA-Seq)应用。然而,在RNA-Seq中,由于高表达基因的自然过度测序,人为假象读数很难与正常读数重复区分开来。特别是在处理少量输入材料或单细胞时,评估重复读数的比例是NGS数据集的一个重要质量控制步骤。到目前为止,只有计算全局重复率的工具,这些工具没有考虑基因表达水平的影响,因此它们对RNA-Seq数据的用途有限。
在这里,我们展示了工具dupRadar,它提供了一种简单的方法来区分由于高表达导致的自然重复产生的读数比例和假象诱导产生的读数比例。dupRadar根据表达水平评估每个基因的重复读数比例。除了Bioconductor软件包dupRadar,我们还提供了外壳脚本以便轻松集成到处理流程中。
Bioconductor软件包dupRadar提供了直接的方法来评估RNA-Seq数据集是否存在PCR重复导致的质量问题。它旨在作为默认的质量控制指标简单地集成到标准分析流程中,这对于低输入量和单细胞RNA-Seq数据集特别有用。