Regan Kevin, Saghafi Abolfazl, Li Zhijun
Department of Chemistry and Biochemistry, University of the Sciences, Philadelphia, PA, USA.
Department of Mathematics, Physics and Statistics, University of the Sciences, Philadelphia, PA, USA.
Curr Genomics. 2021 Dec 30;22(5):384-390. doi: 10.2174/1389202922666211011143008.
Splice junctions are the key to move from pre-messenger RNA to mature messenger RNA in many multi-exon genes due to alternative splicing. Since the percentage of multi-exon genes that undergo alternative splicing is very high, identifying splice junctions is an attractive research topic with important implications.
The aim of this paper is to develop a deep learning model capable of identifying splice junctions in RNA sequences using 13,666 unique sequences of primate RNA.
A Long Short-Term Memory (LSTM) Neural Network model is developed that classifies a given sequence as EI (Exon-Intron splice), IE (Intron-Exon splice), or N (No splice). The model is trained with groups of trinucleotides and its performance is tested using validation and test data to prevent bias.
Model performance was measured using accuracy and f-score in test data. The finalized model achieved an average accuracy of 91.34% with an average f-score of 91.36% over 50 runs.
Comparisons show a highly competitive model to recent Convolutional Neural Network structures. The proposed LSTM model achieves the highest accuracy and f-score among published alternative LSTM structures.
在许多多外显子基因中,由于可变剪接,剪接位点是从前体信使核糖核酸转变为成熟信使核糖核酸的关键。鉴于经历可变剪接的多外显子基因的比例非常高,识别剪接位点是一个具有重要意义的、引人关注的研究课题。
本文旨在开发一种深度学习模型,该模型能够使用13666条灵长类动物RNA的独特序列来识别RNA序列中的剪接位点。
开发了一种长短期记忆(LSTM)神经网络模型,该模型将给定序列分类为EI(外显子-内含子剪接)、IE(内含子-外显子剪接)或N(无剪接)。该模型用三联体核苷酸组进行训练,并使用验证数据和测试数据来测试其性能,以防止偏差。
在测试数据中使用准确率和F值来衡量模型性能。最终模型在50次运行中平均准确率达到91.34%,平均F值达到91.36%。
比较结果表明,该模型与最近的卷积神经网络结构相比具有很强的竞争力。所提出的LSTM模型在已发表的替代LSTM结构中实现了最高的准确率和F值。