Suppr超能文献

SpliceFinder:使用卷积神经网络进行剪接位点的从头预测。

SpliceFinder: ab initio prediction of splice sites using convolutional neural network.

机构信息

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.

出版信息

BMC Bioinformatics. 2019 Dec 27;20(Suppl 23):652. doi: 10.1186/s12859-019-3306-3.

Abstract

BACKGROUND

Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.

RESULT

We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.

CONCLUSION

Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

摘要

背景

鉴定剪接位点是分析基因位置和结构的必要步骤。两个二核苷酸 GT 和 AG 在剪接位点高度频繁出现,许多其他模式也存在于具有重要生物学功能的剪接位点上。同时,这些二核苷酸在没有剪接位点的序列中也频繁出现,这使得预测容易产生假阳性。大多数现有的工具选择所有包含这两个二聚体的序列,然后专注于区分真正的剪接位点和那些伪剪接位点。这种方法会降低假阳性率;然而,它也会导致非规范剪接位点的缺失。

结果

我们基于卷积神经网络(CNN)设计了 SpliceFinder 来预测剪接位点。为了实现从头预测,我们使用人类基因组数据来训练我们的神经网络。我们采用迭代方法来重构数据集,这解决了数据不平衡的问题,并迫使模型学习更多的剪接位点特征。所提出的 CNN 获得了 90.25%的分类准确率,比现有的算法高出 10%。该方法在接收者操作特征(AUC)、召回率、精度和 F1 评分方面均优于其他现有方法。此外,SpliceFinder 可以在长基因组序列上使用滑动窗口找到剪接位点的精确位置。与其他最先进的剪接位点预测工具相比,SpliceFinder 在保持召回率高于 0.8 的同时,产生的假阳性率低一半。此外,SpliceFinder 还能捕获非规范的剪接位点。此外,SpliceFinder 在无需重新训练的情况下,对果蝇、小鼠、大鼠和斑马鱼的基因组序列也能很好地发挥作用。

结论

基于 CNN,我们提出了一种新的从头预测剪接位点的工具 SpliceFinder,它产生的假阳性较少,并且可以检测非规范的剪接位点。此外,SpliceFinder 无需重新训练即可转移到其他物种。源代码和其他材料可在 https://gitlab.deepomics.org/wangruohan/SpliceFinder 上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d64c/6933889/8aa5dc232fab/12859_2019_3306_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验