Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA.
BMC Genomics. 2018 Dec 27;19(1):971. doi: 10.1186/s12864-018-5350-1.
Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation.
In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions.
A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment.
外显子剪接是蛋白质编码基因转录过程中的一种受调控的细胞过程。RNA 测序技术的进步和成本降低,使得对转录组进行定量和定性评估成为可能,并得到了广泛应用。RNA-seq 提供了前所未有的分辨率,可用于识别基因结构并解析剪接变体的多样性。然而,目前可用的从头拼接对齐器由于随机序列匹配和样本参考基因组不一致,容易产生虚假的拼接对齐。因此,将引入大量的假阳性外显子连接预测,这将进一步混淆剪接变体发现和丰度估计的下游分析。
在这项工作中,我们提出了一种基于深度学习的剪接连接序列分类器,名为 DeepSplice,它使用卷积神经网络对候选剪接连接进行分类。我们表明:(I) DeepSplice 在应用于流行的基准数据集 HS3D 时,在外显子剪接分类方面优于最先进的方法;(II) DeepSplice 在外显子剪接分类方面具有很高的准确性,同时利用 GENCODE 注释;(III) 将 DeepSplice 应用于对 Rail-RNA 对齐的 21504 个人类 RNA-seq 数据产生的假定剪接连接进行分类,将 4300 万个候选连接显著减少到约 300 万个高度可信的新剪接连接。
实现了一种可以从注释的外显子连接序列中推断出来的模型,然后可以对从原始 RNA-seq 数据中衍生的剪接连接进行分类。通过全面的基准测试和测试评估和比较了模型的性能,表明该模型具有可靠的性能和用于从 RNA-seq 对齐中分类新剪接连接的广泛可用性。