Université de Bordeaux, Inserm U1212, CNRS UMR5320, Institut Européen de Chimie et Biologie (IECB), 33607 Pessac, France.
Genome Res. 2017 Dec;27(12):2120-2128. doi: 10.1101/gr.224626.117. Epub 2017 Oct 31.
Almost 20 years after the completion of the genome sequence, gene structure annotation is still an ongoing process with new evidence for gene variants still being regularly uncovered by additional in-depth transcriptome studies. While alternative splice forms can allow a single gene to encode several functional isoforms, the question of how much spurious splicing is tolerated is still heavily debated. Here we gathered a compendium of 1682 publicly available RNA-seq data sets to increase the dynamic range of detection of RNA isoforms, and obtained robust measurements of the relative abundance of each splicing event. While most of the splicing reads come from reproducibly detected splicing events, a large fraction of purported junctions is only supported by a very low number of reads. We devised an automated curation method that takes into account the expression level of each gene to discriminate robust splicing events from potential biological noise. We found that rarely used splice sites disproportionately come from highly expressed genes and are significantly less conserved in other nematode genomes than splice sites with a higher usage frequency. Our increased detection power confirmed -splicing for at least 84% of protein coding genes. The genes for which -splicing was not observed are overwhelmingly low expression genes, suggesting that the mechanism is pervasive but not fully captured by organism-wide RNA-seq. We generated annotated gene models including quantitative exon usage information for the entire genome. This allows users to visualize at a glance the relative expression of each isoform for their gene of interest.
在完成基因组序列近 20 年后,基因结构注释仍然是一个持续的过程,通过额外的深入转录组研究,仍然在不断发现新的基因变异证据。虽然可变剪接可以使单个基因编码几个功能异构体,但对于允许多少错误剪接仍存在很大争议。在这里,我们收集了 1682 个公开可用的 RNA-seq 数据集,以增加 RNA 异构体检测的动态范围,并获得了每个剪接事件相对丰度的稳健测量。虽然大多数剪接读数来自可重复检测到的剪接事件,但很大一部分所谓的接头仅得到很少的读数支持。我们设计了一种自动化的策展方法,该方法考虑了每个基因的表达水平,以区分稳健的剪接事件和潜在的生物学噪声。我们发现,很少使用的剪接位点不成比例地来自高表达基因,并且在其他线虫基因组中的保守性明显低于使用频率较高的剪接位点。我们增加的检测能力证实了至少 84%的蛋白质编码基因存在-splicing。没有观察到-splicing 的基因绝大多数是低表达基因,这表明该机制普遍存在,但在全基因组 RNA-seq 中并未完全捕获。我们生成了包括整个基因组定量外显子使用信息的注释基因模型。这允许用户一目了然地查看他们感兴趣的基因的每个异构体的相对表达。