Department of Biochemistry, Stanford University, Stanford, CA, USA.
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Bioinformatics. 2019 Apr 15;35(8):1263-1268. doi: 10.1093/bioinformatics/bty785.
Identification of splice sites is critical to gene annotation and to determine which sequences control circRNA biogenesis. Full-length RNA transcripts could in principle complete annotations of introns and exons in genomes without external ontologies, i.e., ab initio. However, whether it is possible to reconstruct genomic positions where splicing occurs from full-length transcripts, even if sampled in the absence of noise, depends on the genome sequence composition. If it is not, there exist provable limits on the use of RNA-Seq to define splice locations (linear or circular) in the genome.
We provide a formal definition of splice site ambiguity due to the genomic sequence by introducing equivalent junction, which is the set of local genomic positions resulting in the same RNA sequence when joined through RNA splicing. We show that equivalent junctions are prevalent in diverse eukaryotic genomes and occur in 88.64% and 78.64% of annotated human splice sites in linear and circRNA junctions, respectively. The observed fractions of equivalent junctions and the frequency of many individual motifs are statistically significant when compared against the null distribution computed via simulation or closed-form. The frequency of equivalent junctions establishes a fundamental limit on the possibility of ab initio reconstruction of RNA transcripts without appealing to the ontology of "GT-AG" boundaries defining introns. Said differently, completely ab initio is impossible in the vast majority of splice sites in annotated circRNAs and linear transcripts.
Two python scripts generating an equivalent junction sequence per junction are available at: https://github.com/salzmanlab/Equivalent-Junctions.
Supplementary data are available at Bioinformatics online.
剪接位点的鉴定对于基因注释以及确定哪些序列控制 circRNA 的生物发生至关重要。全长 RNA 转录本原则上可以在没有外部本体论(即从头开始)的情况下完成基因组中外显子和内含子的注释。然而,即使在没有噪声的情况下进行采样,从全长转录本中重建发生剪接的基因组位置是否可能,取决于基因组序列组成。如果不可能,则存在可证明的限制,即使用 RNA-Seq 来定义基因组中剪接位置(线性或圆形)。
我们通过引入等效连接来为基因组序列引起的剪接位点歧义提供正式定义,等效连接是通过 RNA 剪接连接导致相同 RNA 序列的局部基因组位置集。我们表明,等效连接在不同的真核生物基因组中很普遍,并且在人类线性和 circRNA 剪接中分别注释的剪接位点的 88.64%和 78.64%中出现。与通过模拟或闭式计算的零分布相比,观察到的等效连接分数和许多单个基序的频率具有统计学意义。等效连接的频率建立了一个基本限制,即在不依赖于定义内含子的“GT-AG”边界本体论的情况下,从头开始重建 RNA 转录本的可能性。换句话说,在注释的 circRNA 和线性转录本中,绝大多数剪接位点完全从头开始是不可能的。
两个生成每个连接的等效连接序列的 python 脚本可在 https://github.com/salzmanlab/Equivalent-Junctions 获得。
补充数据可在生物信息学在线获得。