Informatics Group, Harvard University, Cambridge, MA, 02138, USA.
BMC Bioinformatics. 2020 Apr 19;21(1):149. doi: 10.1186/s12859-020-3484-z.
Typical experimental design advice for expression analyses using RNA-seq generally assumes that single-end reads provide robust gene-level expression estimates in a cost-effective manner, and that the additional benefits obtained from paired-end sequencing are not worth the additional cost. However, in many cases (e.g., with Illumina NextSeq and NovaSeq instruments), shorter paired-end reads and longer single-end reads can be generated for the same cost, and it is not obvious which strategy should be preferred. Using publicly available data, we test whether short-paired end reads can achieve more robust expression estimates and differential expression results than single-end reads of approximately the same total number of sequenced bases.
At both the transcript and gene levels, 2 × 40 paired-end reads unequivocally provide expression estimates that are more highly correlated with 2 × 125 than 1 × 75 reads; in nearly all cases, those correlations are also greater than for 1 × 125, despite the greater total number of sequenced bases for the latter. Across an array of metrics, differential expression tests based upon 2 × 40 consistently outperform those using 1 × 75.
Researchers seeking a cost-effective approach for gene-level expression analysis should prefer short paired-end reads over a longer single-end strategy. Short paired-end reads will also give reasonably robust expression estimates and differential expression results at the isoform level.
使用 RNA-seq 进行表达分析的典型实验设计建议通常假设单端读取以具有成本效益的方式提供稳健的基因水平表达估计,并且从配对末端测序获得的额外益处不值得额外的成本。然而,在许多情况下(例如,使用 Illumina NextSeq 和 NovaSeq 仪器),可以以相同的成本生成较短的配对末端读取和较长的单末端读取,并且不清楚应该优先考虑哪种策略。使用公开可用的数据,我们测试了短配对末端读取是否可以比大约相同数量的测序碱基的单末端读取实现更稳健的表达估计和差异表达结果。
在转录本和基因水平上,2×40 配对末端读取明确提供了与 2×125 比 1×75 更相关的表达估计;在几乎所有情况下,这些相关性也大于 1×125,尽管后者的总测序碱基数更多。在一系列指标中,基于 2×40 的差异表达测试始终优于基于 1×75 的测试。
寻求具有成本效益的基因水平表达分析方法的研究人员应优先选择短的配对末端读取,而不是更长的单末端策略。短配对末端读取也将在异构体水平上提供相当稳健的表达估计和差异表达结果。