Chhangawala Sagar, Rudy Gabe, Mason Christopher E, Rosenfeld Jeffrey A
The Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, 10021, USA.
Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, 10021, USA.
Genome Biol. 2015 Jun 23;16(1):131. doi: 10.1186/s13059-015-0697-y.
The initial next-generation sequencing technologies produced reads of 25 or 36 bp, and only from a single-end of the library sequence. Currently, it is possible to reliably produce 300 bp paired-end sequences for RNA expression analysis. While read lengths have consistently increased, people have assumed that longer reads are more informative and that paired-end reads produce better results than single-end reads. We used paired-end 101 bp reads and trimmed them to simulate different read lengths, and also separated the pairs to produce single-end reads. For each read length and paired status, we evaluated differential expression levels between two standard samples and compared the results to those obtained by qPCR.
We found that, with the exception of 25 bp reads, there is little difference for the detection of differential expression regardless of the read length. Once single-end reads are at a length of 50 bp, the results do not change substantially for any level up to, and including, 100 bp paired-end. However, splice junction detection significantly improves as the read length increases with 100 bp paired-end showing the best performance. We performed the same analysis on two ENCODE samples and found consistent results confirming that our conclusions have broad application.
A researcher could save substantial resources by using 50 bp single-end reads for differential expression analysis instead of using longer reads. However, splicing detection is unquestionably improved by paired-end and longer reads. Therefore, an appropriate read length should be used based on the final goal of the study.
最初的新一代测序技术生成的读段长度为25或36碱基对,且仅来自文库序列的单端。目前,对于RNA表达分析而言,可靠地生成300碱基对的双端序列已成为可能。虽然读段长度一直在持续增加,但人们一直认为更长的读段信息量更大,且双端读段比单端读段能产生更好的结果。我们使用了101碱基对的双端读段,并对其进行修剪以模拟不同的读段长度,还将双端读段分开以生成单端读段。对于每个读段长度和配对状态,我们评估了两个标准样本之间的差异表达水平,并将结果与通过定量PCR获得的结果进行比较。
我们发现,除了25碱基对的读段外,无论读段长度如何,差异表达检测几乎没有差异。一旦单端读段长度达到50碱基对,对于任何长度直至并包括100碱基对双端读段的情况,结果都不会有实质性变化。然而,随着读段长度增加,剪接位点检测显著改善,100碱基对双端读段表现最佳。我们对两个ENCODE样本进行了相同的分析,发现结果一致,证实我们的结论具有广泛适用性。
研究人员通过使用50碱基对的单端读段进行差异表达分析而非更长的读段,可以节省大量资源。然而,双端和更长的读段无疑能改善剪接检测。因此,应根据研究的最终目标使用合适的读段长度。