González Emmanuel, Joly Simon
Institut de recherche en biologie végétale, Université de Montréal, 4101 Sherbrooke E, Montréal, H1X 2B2, (QC), Canada.
BMC Res Notes. 2013 Dec 3;6:503. doi: 10.1186/1756-0500-6-503.
High-throughput RNA sequencing studies are becoming increasingly popular and differential expression studies represent an important downstream analysis that often follow de novo transcriptome assembly. If a lot of attention has been given to bioinformatics tools for differential gene expression, little has yet been given to the impact of the sequence data itself used in pipelines.
We tested how using different types of reads from the ones used to assemble a de novo transcriptome (both differing in length and pairing attributes) could potentially affect differential expression (DE) results. To investigate this, we created artificial datasets out of long paired-end RNA-seq datasets initially used to build the assembly. All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes. If the false positive rate for differential gene expression does not seem to be strongly affected by sequencing strategy (max. of 3.5%), it could reach 12.2% or 28.1% for differential isoform expression depending of the pipeline used. The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.
In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.
高通量RNA测序研究日益普及,差异表达研究是一种重要的下游分析,通常在从头转录组组装之后进行。尽管人们对用于差异基因表达的生物信息学工具给予了很多关注,但对于流程中使用的序列数据本身的影响却关注甚少。
我们测试了使用与用于组装从头转录组的reads不同类型的reads(长度和配对属性均不同)如何可能影响差异表达(DE)结果。为了研究这一点,我们从最初用于构建组装的长配对末端RNA-seq数据集中创建了人工数据集。通过DE分析比较所有数据集,由于所有样本都来自同一次测序运行,基因或异构体的DE可解释为由序列属性导致的假阳性。如果差异基因表达的假阳性率似乎不受测序策略的强烈影响(最高为3.5%),那么根据所使用的流程,差异异构体表达的假阳性率可能达到12.2%或28.1%。发现配对末端与单末端策略的影响在假阳性方面比序列长度的影响大得多。
根据假阳性率结果,我们建议在差异表达研究中使用配对末端序列而非单末端序列,即使对于差异基因表达影响较小。