Splam：一种基于深度学习的剪接位点预测器，可改善剪接比对。

Splam: a deep-learning-based splice site predictor that improves spliced alignments.

作者信息

Chao Kuan-Hao, Mao Alan, Salzberg Steven L, Pertea Mihaela

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.

Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA.

出版信息

bioRxiv. 2023 Jul 29:2023.07.27.550754. doi: 10.1101/2023.07.27.550754.

DOI:10.1101/2023.07.27.550754

PMID:37546880

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10402160/

Abstract

The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. Here we describe Splam, a novel method for predicting splice junctions in DNA based on deep residual convolutional neural networks. Unlike some previous models, Splam looks at a relatively limited window of 400 base pairs flanking each splice site, motivated by the observation that the biological process of splicing relies primarily on signals within this window. Additionally, Splam introduces the idea of training the network on donor and acceptor pairs together, based on the principle that the splicing machinery recognizes both ends of each intron at once. We compare Splam's accuracy to recent state-of-the-art splice site prediction methods, particularly SpliceAI, another method that uses deep neural networks. Our results show that Splam is consistently more accurate than SpliceAI, with an overall accuracy of 96% at predicting human splice junctions. Splam generalizes even to non-human species, including distant ones like the flowering plant . Finally, we demonstrate the use of Splam on a novel application: processing the spliced alignments of RNA-seq data to identify and eliminate errors. We show that when used in this manner, Splam yields substantial improvements in the accuracy of downstream transcriptome analysis of both poly(A) and ribo-depleted RNA-seq libraries. Overall, Splam offers a faster and more accurate approach to detecting splice junctions, while also providing a reliable and efficient solution for cleaning up erroneous spliced alignments.

摘要

将信使核糖核酸剪接以去除内含子的过程在基因和基因变体的形成中起着核心作用。在此，我们描述了Splam，一种基于深度残差卷积神经网络预测DNA中剪接位点的新方法。与一些先前的模型不同，Splam着眼于每个剪接位点两侧400个碱基对的相对有限窗口，这是基于剪接的生物学过程主要依赖于该窗口内信号的观察结果。此外，Splam引入了基于剪接机制同时识别每个内含子两端的原理，对供体和受体对一起进行网络训练的理念。我们将Splam的准确性与最近的最先进剪接位点预测方法，特别是另一种使用深度神经网络的方法SpliceAI进行比较。我们的结果表明，Splam始终比SpliceAI更准确，在预测人类剪接位点时总体准确率为96%。Splam甚至可以推广到非人类物种，包括像开花植物这样的远缘物种。最后，我们展示了Splam在一个新应用中的使用：处理RNA测序数据的剪接比对以识别和消除错误。我们表明，以这种方式使用时，Splam在对聚腺苷酸和核糖体去除的RNA测序文库的下游转录组分析准确性方面有显著提高。总体而言，Splam提供了一种更快、更准确的检测剪接位点的方法，同时也为清理错误的剪接比对提供了可靠且高效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9224/10402160/95bbe8d1d583/nihpp-2023.07.27.550754v2-f0001.jpg

相似文献

Splam: a deep-learning-based splice site predictor that improves spliced alignments.Splam：一种基于深度学习的剪接位点预测器，可改善剪接比对。

bioRxiv. 2023 Jul 29:2023.07.27.550754. doi: 10.1101/2023.07.27.550754.

Splam: a deep-learning-based splice site predictor that improves spliced alignments.Splam：一种基于深度学习的剪接位点预测器，可提高剪接对齐。

Genome Biol. 2024 Sep 16;25(1):243. doi: 10.1186/s13059-024-03379-4.

Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach.从 RNA-seq 比对中识别新的剪接接头：一种深度学习方法。

BMC Genomics. 2018 Dec 27;19(1):971. doi: 10.1186/s12864-018-5350-1.

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.Splice2Deep：用于改进基因组DNA中剪接位点预测的深度卷积神经网络集成方法。

Gene X. 2020 May 13;5:100035. doi: 10.1016/j.gene.2020.100035. eCollection 2020 Dec.

CI-SpliceAI-Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites.CI-SpliceAI-利用已注释的可变剪接位点来改进疾病相关剪接变异体的机器学习预测。

PLoS One. 2022 Jun 3;17(6):e0269159. doi: 10.1371/journal.pone.0269159. eCollection 2022.

RNA-Seq approach for accurate characterization of splicing efficiency of yeast introns.RNA-Seq 方法可准确描述酵母内含子剪接效率。

Methods. 2020 Apr 1;176:25-33. doi: 10.1016/j.ymeth.2019.03.019. Epub 2019 Mar 26.

Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data.读取-分割-运行：一种利用RNA测序数据识别全基因组非经典剪接区域的改进型生物信息学流程。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):503. doi: 10.1186/s12864-016-2896-7.

Gene. 2020 Dec;763S:100035. doi: 10.1016/j.gene.2020.100035. Epub 2020 May 13.

A high-resolution single-molecule sequencing-based Arabidopsis transcriptome using novel methods of Iso-seq analysis.利用 Iso-seq 分析的新方法进行高分辨率的单个分子测序的拟南芥转录组。

Genome Biol. 2022 Jul 7;23(1):149. doi: 10.1186/s13059-022-02711-0.

EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes.EASTR：鉴定和消除多外显子基因中的系统比对错误。

Nat Commun. 2023 Nov 9;14(1):7223. doi: 10.1038/s41467-023-43017-4.

本文引用的文献

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure.CHESS 3：基于大规模表达数据、系统发育分析和蛋白质结构，改进和综合的人类基因和转录本目录。

Genome Biol. 2023 Oct 30;24(1):249. doi: 10.1186/s13059-023-03088-4.

A joint NCBI and EMBL-EBI transcript set for clinical genomics and research.临床基因组学和研究用的 NCBI 和 EMBL-EBI 联合转录本集。

Nature. 2022 Apr;604(7905):310-315. doi: 10.1038/s41586-022-04558-8. Epub 2022 Apr 6.

Spliceator: multi-species splice site prediction using convolutional neural networks.Spliceator：使用卷积神经网络进行多物种剪接位点预测。

BMC Bioinformatics. 2021 Nov 23;22(1):561. doi: 10.1186/s12859-021-04471-3.

Overlapping genes in natural and engineered genomes.天然和工程基因组中的重叠基因。

Nat Rev Genet. 2022 Mar;23(3):154-168. doi: 10.1038/s41576-021-00417-w. Epub 2021 Oct 5.

TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets.TieBrush：一种跨大型数据集聚合和汇总比对读段的有效方法。

Bioinformatics. 2021 Oct 25;37(20):3650-3651. doi: 10.1093/bioinformatics/btab342.

HTSlib: C library for reading/writing high-throughput sequencing data.HTSlib：用于读取/写入高通量测序数据的 C 库。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab007.

Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments.转录噪声对RNA测序实验中基因和转录本表达估计的影响。

Genome Res. 2021 Feb;31(2):301-308. doi: 10.1101/gr.266213.120. Epub 2020 Dec 23.

Unusually efficient CUG initiation of an overlapping reading frame in mRNA yields novel protein POLGARF.mRNA 中 CUG 异常有效地起始重叠阅读框，产生新的蛋白 POLGARF。

Proc Natl Acad Sci U S A. 2020 Oct 6;117(40):24936-24946. doi: 10.1073/pnas.2001433117. Epub 2020 Sep 21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Splam：一种基于深度学习的剪接位点预测器，可改善剪接比对。

Splam: a deep-learning-based splice site predictor that improves spliced alignments.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献