Akpokiro Victor, Chowdhury H M A Mohit, Olowofila Samuel, Nusrat Raisa, Oluwadare Oluwatosin
Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, United States.
Comput Struct Biotechnol J. 2023 May 30;21:3210-3223. doi: 10.1016/j.csbj.2023.05.031. eCollection 2023.
The identification of splice site, or segments of an RNA gene where noncoding and coding sequences are connected in the 5' and 3' directions, is an essential post-transcriptional step for the annotation of functional genes and is required for the study and analysis of biological function in eukaryotic organisms through protein production and gene expression. Splice site detection tools have been proposed for this purpose; however, the models of these tools have a specific use case and are inefficiently or typically untransferable between organisms. Here, we present CNNSplice, a set of deep convolutional neural network models for splice site prediction. Using the five-fold cross-validation model selection technique, we explore several models based on typical machine learning applications and propose five high-performing models to efficiently predict the true and false SS in balanced and imbalanced datasets. Our evaluation results indicate that CNNSplice's models achieve a better performance compared with existing methods across five organisms' datasets. In addition, our generality test shows CNNSplice's model ability to predict and annotate splice sites in new or poorly trained genome datasets indicating a broad application spectrum. CNNSplice demonstrates improved model prediction, interpretability, and generalizability on genomic datasets compared to existing splice site prediction tools. We have developed a web server for the CNNSplice algorithm which can be publicly accessed here: http://www.cnnsplice.online.
剪接位点的识别,即RNA基因中5'和3'方向上非编码序列和编码序列相连的片段,是功能基因注释中必不可少的转录后步骤,也是通过蛋白质产生和基因表达来研究和分析真核生物生物学功能所必需的。为此已经提出了剪接位点检测工具;然而,这些工具的模型有特定的用例,在不同生物体之间效率低下或通常不可转移。在此,我们展示了CNNSplice,这是一组用于剪接位点预测的深度卷积神经网络模型。使用五折交叉验证模型选择技术,我们基于典型的机器学习应用探索了几种模型,并提出了五个高性能模型,以在平衡和不平衡数据集中有效预测真、假剪接位点。我们的评估结果表明,与现有方法相比,CNNSplice的模型在五个生物体的数据集上表现更好。此外,我们的通用性测试表明,CNNSplice的模型能够在新的或训练不足的基因组数据集中预测和注释剪接位点,这表明其具有广泛的应用范围。与现有的剪接位点预测工具相比,CNNSplice在基因组数据集上展示了改进的模型预测、可解释性和通用性。我们为CNNSplice算法开发了一个网络服务器,可在此处公开访问:http://www.cnnsplice.online。