Corley Susan M, MacKenzie Karen L, Beverdam Annemiek, Roddam Louise F, Wilkins Marc R
Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, UNSW Australia, Sydney, New South Wales, Australia.
Children's Cancer Institute Australia, Kensington New South Wales, Sydney, Australia.
BMC Genomics. 2017 May 23;18(1):399. doi: 10.1186/s12864-017-3797-0.
RNA-Seq is now widely used as a research tool. Choices must be made whether to use paired-end (PE) or single-end (SE) sequencing, and whether to use strand-specific or non-specific (NS) library preparation kits. To date there has been no analysis of the effect of these choices on identifying differentially expressed genes (DEGs) between controls and treated samples and on downstream functional analysis.
We undertook four mammalian transcriptomics experiments to compare the effect of SE and PE protocols on read mapping, feature counting, identification of DEGs and functional analysis. For three of these experiments we also compared a non-stranded (NS) and a strand-specific approach to mapping the paired-end data. SE mapping resulted in a reduced number of reads mapped to features, in all four experiments, and lower read count per gene. Up to 4.3% of genes in the SE data and up to 12.3% of genes in the NS data had read counts which were significantly different compared to the PE data. Comparison of DEGs showed the presence of false positives (average 5%, using voom) and false negatives (average 5%, using voom) using the SE reads. These increased further, by one or two percentage points, with the NS data. Gene ontology functional enrichment (GO) of the DEGs arising from SE or NS approaches, revealed striking differences in the top 20 GO terms, with as little as 40% concordance with PE results. Caution is therefore advised in the interpretation of such results. By comparison, there was overall consistency in gene set enrichment analysis results.
A strand-specific protocol should be used in library preparation to generate the most reliable and accurate profile of expression. Ideally PE reads are also recommended particularly for transcriptome assembly. Whilst SE reads produce a DEG list with around 5% of false positives and false negatives, this method can substantially reduce sequencing cost and this saving could be used to increase the number of biological replicates thereby increasing the power of the experiment. As SE reads, when used in association with gene set enrichment, can generate accurate biological results, this may be a desirable trade-off.
RNA测序(RNA-Seq)如今被广泛用作一种研究工具。必须做出选择,是使用双端测序(PE)还是单端测序(SE),以及是否使用链特异性或非特异性(NS)文库制备试剂盒。迄今为止,尚未有关于这些选择对识别对照样本和处理样本之间差异表达基因(DEG)以及对下游功能分析的影响的分析。
我们进行了四项哺乳动物转录组学实验,以比较SE和PE方案对读段比对、特征计数、DEG识别和功能分析的影响。对于其中三项实验,我们还比较了非链特异性(NS)和链特异性方法来比对双端数据。在所有四项实验中,SE比对导致比对到特征上的读段数量减少,且每个基因的读段计数较低。与PE数据相比,SE数据中高达4.3%的基因以及NS数据中高达12.3%的基因的读段计数存在显著差异。对DEG的比较显示,使用SE读段时存在假阳性(平均5%,使用voom)和假阴性(平均5%,使用voom)情况。对于NS数据,这些情况又增加了一两个百分点。来自SE或NS方法的DEG在基因本体功能富集(GO)方面,在前20个GO术语中显示出显著差异,与PE结果的一致性低至40%。因此,在解释此类结果时应谨慎。相比之下,基因集富集分析结果总体上具有一致性。
文库制备中应使用链特异性方案以生成最可靠和准确的表达谱。理想情况下,也推荐使用PE读段,特别是用于转录组组装。虽然SE读段产生的DEG列表有大约5%的假阳性和假阴性,但这种方法可以大幅降低测序成本,并可将节省的成本用于增加生物学重复次数,从而提高实验效能。由于SE读段与基因集富集结合使用时可以产生准确的生物学结果,这可能是一种理想的权衡。