School of Biological Sciences, University of Aberdeen, Zoology Building, Tillydrone Avenue, Aberdeen, AB24 2TZ, UK.
School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Institute of Medical Sciences, Foresterhill, Aberdeen, AB25 2ZD, UK.
BMC Bioinformatics. 2021 Mar 22;22(1):140. doi: 10.1186/s12859-021-04009-7.
Spliced leader (SL) trans-splicing replaces the 5' end of pre-mRNAs with the spliced leader, an exon derived from a specialised non-coding RNA originating from elsewhere in the genome. This process is essential for resolving polycistronic pre-mRNAs produced by eukaryotic operons into monocistronic transcripts. SL trans-splicing and operons may have independently evolved multiple times throughout Eukarya, yet our understanding of these phenomena is limited to only a few well-characterised organisms, most notably C. elegans and trypanosomes. The primary barrier to systematic discovery and characterisation of SL trans-splicing and operons is the lack of computational tools for exploiting the surge of transcriptomic and genomic resources for a wide range of eukaryotes.
Here we present two novel pipelines that automate the discovery of SLs and the prediction of operons in eukaryotic genomes from RNA-Seq data. SLIDR assembles putative SLs from 5' read tails present after read alignment to a reference genome or transcriptome, which are then verified by interrogating corresponding SL RNA genes for sequence motifs expected in bona fide SL RNA molecules. SLOPPR identifies RNA-Seq reads that contain a given 5' SL sequence, quantifies genome-wide SL trans-splicing events and predicts operons via distinct patterns of SL trans-splicing events across adjacent genes. We tested both pipelines with organisms known to carry out SL trans-splicing and organise their genes into operons, and demonstrate that (1) SLIDR correctly detects expected SLs and often discovers novel SL variants; (2) SLOPPR correctly identifies functionally specialised SLs, correctly predicts known operons and detects plausible novel operons.
SLIDR and SLOPPR are flexible tools that will accelerate research into the evolutionary dynamics of SL trans-splicing and operons throughout Eukarya and improve gene discovery and annotation for a wide range of eukaryotic genomes. Both pipelines are implemented in Bash and R and are built upon readily available software commonly installed on most bioinformatics servers. Biological insight can be gleaned even from sparse, low-coverage datasets, implying that an untapped wealth of information can be retrieved from existing RNA-Seq datasets as well as from novel full-isoform sequencing protocols as they become more widely available.
拼接体(SL)转位拼接用拼接体代替前体 mRNA 的 5' 端,拼接体是一种来自基因组其他位置的专门非编码 RNA 的外显子。这个过程对于解决真核生物操纵子产生的多顺反子前体 mRNA 成为单顺反子转录物是必不可少的。SL 转位拼接和操纵子可能在整个真核生物中独立进化了多次,但我们对这些现象的理解仅限于少数几个特征明确的生物体,最著名的是秀丽隐杆线虫和锥虫。系统发现和描述 SL 转位拼接和操纵子的主要障碍是缺乏计算工具,无法利用广泛的真核生物转录组和基因组资源。
在这里,我们提出了两个新的管道,用于从 RNA-Seq 数据中自动化发现真核生物基因组中的 SL 和预测操纵子。SLIDR 从与参考基因组或转录组对齐后的 5' 读取尾部组装假定的 SL,然后通过查询相应的 SL RNA 基因中预期的序列基序来验证这些 SL。SLOPPR 识别包含给定 5' SL 序列的 RNA-Seq 读取,量化全基因组 SL 转位拼接事件,并通过相邻基因中 SL 转位拼接事件的不同模式预测操纵子。我们用已知进行 SL 转位拼接并将其基因组织成操纵子的生物体测试了这两个管道,并证明:(1)SLIDR 正确检测预期的 SL,并且经常发现新的 SL 变体;(2)SLOPPR 正确识别功能专门化的 SL,正确预测已知的操纵子,并检测可能的新操纵子。
SLIDR 和 SLOPPR 是灵活的工具,将加速对整个真核生物中 SL 转位拼接和操纵子的进化动态的研究,并改进广泛的真核生物基因组的基因发现和注释。这两个管道都是用 Bash 和 R 编写的,并且是基于大多数生物信息学服务器上通常安装的现成软件构建的。即使在稀疏、低覆盖的数据集上,也可以获得生物学见解,这意味着可以从现有的 RNA-Seq 数据集中以及从越来越广泛可用的新型全长异构体测序方案中检索到未开发的丰富信息。