Singh Noopur, Nath Ravindra, Singh Dev Bukhsh
Dr. A. P. J. Abdul Kalam Technical University, Lucknow, 226021, India.
Department of Computer Science, University Institute Engineering and Technology, Chhatrapati Sahu Ji Maharaj University, Kanpur, 208024, India.
Biochem Biophys Rep. 2022 May 26;30:101285. doi: 10.1016/j.bbrep.2022.101285. eCollection 2022 Jul.
Machine learning methods played a major role in improving the accuracy of predictions and classification of DNA (Deoxyribonucleic Acid) and protein sequences. In eukaryotes, Splice-site identification and prediction is though not a straightforward job because of numerous false positives. To solve this problem, here, in this paper, we represent a bidirectional Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) based deep learning model that has been developed to identify and predict the splice-sites for the prediction of exons from eukaryotic DNA sequences. During the splicing mechanism of the primary mRNA transcript, the introns, the non-coding region of the gene are spliced out and the exons, the coding region of the gene are joined. This bidirectional LSTM-RNN model uses the intron features that start with splice site donor (GT) and end with splice site acceptor (AG) in order of its length constraints. The model has been improved by increasing the number of epochs while training. This designed model achieved a maximum accuracy of 95.5%. This model is compatible with huge sequential data such as the complete genome.
机器学习方法在提高DNA(脱氧核糖核酸)和蛋白质序列预测及分类的准确性方面发挥了重要作用。在真核生物中,由于存在大量假阳性,剪接位点的识别和预测并非易事。为了解决这个问题,在本文中,我们提出了一种基于双向长短期记忆(LSTM)循环神经网络(RNN)的深度学习模型,该模型用于从真核生物DNA序列中识别和预测外显子的剪接位点。在初级mRNA转录本的剪接机制中,基因的非编码区域内含子被剪接出去,而基因的编码区域外显子则被连接起来。这种双向LSTM-RNN模型利用了以剪接位点供体(GT)开始并以剪接位点受体(AG)结束的内含子特征,并按照其长度限制进行处理。在训练过程中,通过增加轮次数量对模型进行了改进。该设计模型实现了95.5%的最高准确率。该模型适用于诸如完整基因组等大量序列数据。