Splice2Deep：用于改进基因组DNA中剪接位点预测的深度卷积神经网络集成方法。

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.

作者信息

Albaradei Somayah, Magana-Mora Arturo, Thafar Maha, Uludag Mahmut, Bajic Vladimir B, Gojobori Takashi, Essack Magbubah, Jankovic Boris R

机构信息

Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia; Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia.

Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia; Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia.

出版信息

Gene. 2020 Dec;763S:100035. doi: 10.1016/j.gene.2020.100035. Epub 2020 May 13.

DOI:10.1016/j.gene.2020.100035

PMID:34493371

Abstract

BACKGROUND

The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes.

RESULTS

With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model.

CONCLUSIONS

The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.

摘要

背景

准确识别外显子/内含子边界对于正确注释具有多个外显子的基因至关重要。供体和受体剪接位点（SS）划定了这些边界。因此，推导准确的计算模型来预测剪接位点对于基因和基因组的功能注释以及寻找与不同疾病相关的替代剪接位点很有用。尽管已经提出了各种模型用于剪接位点的计算机预测，但为了可靠注释仍需要提高其准确性。此外，模型通常是使用相同的基因组推导和测试的，没有提供广泛应用的证据，即应用于其他研究较少的基因组。

结果

考虑到这一点，我们开发了用于剪接位点检测的Splice2Deep模型。每个模型都是深度卷积神经网络的集成。我们基于在智人、粳稻、拟南芥、黑腹果蝇和秀丽隐杆线虫中检测剪接位点的能力评估了模型的性能。结果表明，这些模型能够有效地检测在模型训练过程中未考虑的其他生物体中的剪接位点。与最先进的工具相比，Splice2Deep模型在受体和供体剪接位点的平均错误率分别显著降低了41.97%和28.51%。此外，Splice2Deep跨生物体验证表明，模型通过选择分类学上最接近的模型能够正确识别保守的基因组元件，从而实现对新基因组中剪接位点的注释。

结论

我们的研究结果表明，与其他最先进的模型相比，Splice2Deep不仅实现了显著降低的错误率，而且能够准确识别未针对其进行训练的其他生物体中的剪接位点，从而实现对研究较少或新测序的基因组的注释。Splice2Deep模型使用Keras API在Python中实现；模型和数据可在https://github.com/SomayahAlbaradei/Splice_Deep.git获取。