Suppr超能文献

用于评估深度学习模型进行剪接位点预测的自动化框架。

An automated framework for evaluation of deep learning models for splice site predictions.

机构信息

Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey.

Institute of Data Science, Maastricht University, Maastricht, The Netherlands.

出版信息

Sci Rep. 2023 Jun 23;13(1):10221. doi: 10.1038/s41598-023-34795-4.

Abstract

A novel framework for the automated evaluation of various deep learning-based splice site detectors is presented. The framework eliminates time-consuming development and experimenting activities for different codebases, architectures, and configurations to obtain the best models for a given RNA splice site dataset. RNA splicing is a cellular process in which pre-mRNAs are processed into mature mRNAs and used to produce multiple mRNA transcripts from a single gene sequence. Since the advancement of sequencing technologies, many splice site variants have been identified and associated with the diseases. So, RNA splice site prediction is essential for gene finding, genome annotation, disease-causing variants, and identification of potential biomarkers. Recently, deep learning models performed highly accurately for classifying genomic signals. Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and its bidirectional version (BLSTM), Gated Recurrent Unit (GRU), and its bidirectional version (BGRU) are promising models. During genomic data analysis, CNN's locality feature helps where each nucleotide correlates with other bases in its vicinity. In contrast, BLSTM can be trained bidirectionally, allowing sequential data to be processed from forward and reverse directions. Therefore, it can process 1-D encoded genomic data effectively. Even though both methods have been used in the literature, a performance comparison was missing. To compare selected models under similar conditions, we have created a blueprint for a series of networks with five different levels. As a case study, we compared CNN and BLSTM models' learning capabilities as building blocks for RNA splice site prediction in two different datasets. Overall, CNN performed better with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) in human splice site prediction. Likewise, an outperforming performance with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) is achieved in C. elegans splice site prediction. Overall, our results showed that CNN learns faster than BLSTM and BGRU. Moreover, CNN performs better at extracting sequence patterns than BLSTM and BGRU. To our knowledge, no other framework is developed explicitly for evaluating splice detection models to decide the best possible model in an automated manner. So, the proposed framework and the blueprint would help selecting different deep learning models, such as CNN vs. BLSTM and BGRU, for splice site analysis or similar classification tasks and in different problems.

摘要

提出了一种新的框架,用于自动评估各种基于深度学习的剪接位点探测器。该框架消除了为不同的代码库、架构和配置进行耗时的开发和实验活动的需要,以便为给定的 RNA 剪接位点数据集获得最佳模型。RNA 剪接是一种细胞过程,其中前体 mRNA 被加工成成熟的 mRNA,并用于从单个基因序列产生多个 mRNA 转录本。自从测序技术的进步以来,已经发现了许多剪接位点变体,并与疾病相关。因此,RNA 剪接位点预测对于基因发现、基因组注释、致病变体和潜在生物标志物的识别至关重要。最近,深度学习模型在分类基因组信号方面表现出了高度的准确性。卷积神经网络(CNN)、长短期记忆(LSTM)及其双向版本(BLSTM)、门控循环单元(GRU)及其双向版本(BGRU)是很有前途的模型。在基因组数据分析中,CNN 的局部特征有助于每个核苷酸与其附近的其他碱基相关联。相比之下,BLSTM 可以双向训练,允许从正向和反向处理顺序数据。因此,它可以有效地处理 1-D 编码的基因组数据。尽管这两种方法都在文献中得到了应用,但缺乏性能比较。为了在相似的条件下比较选定的模型,我们创建了一个蓝图,用于构建具有五个不同级别的一系列网络。作为案例研究,我们比较了 CNN 和 BLSTM 模型在两个不同数据集的 RNA 剪接位点预测中的学习能力。总体而言,在人类剪接位点预测中,CNN 以 [Formula: see text] 的准确率([Formula: see text] 的提高)、[Formula: see text] 的 F1 分数([Formula: see text] 的提高)和 [Formula: see text] 的 AUC-PR([Formula: see text] 的提高)表现更好。同样,在秀丽隐杆线虫剪接位点预测中,CNN 以 [Formula: see text] 的准确率([Formula: see text] 的提高)、[Formula: see text] 的 F1 分数([Formula: see text] 的提高)和 [Formula: see text] 的 AUC-PR([Formula: see text] 的提高)实现了出色的性能。总体而言,我们的结果表明,CNN 比 BLSTM 和 BGRU 学习速度更快。此外,CNN 比 BLSTM 和 BGRU 更擅长提取序列模式。据我们所知,没有其他框架专门用于评估剪接检测模型,以自动决定最佳模型。因此,所提出的框架和蓝图将有助于选择不同的深度学习模型,例如 CNN 与 BLSTM 和 BGRU,用于剪接位点分析或类似的分类任务以及不同的问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6971/10290104/066ba220ad09/41598_2023_34795_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验