用于转录起始位点识别的深度学习与支持向量机

Deep learning and support vector machines for transcription start site identification.

作者信息

Barbero-Aparicio José A, Olivares-Gil Alicia, Díez-Pastor José F, García-Osorio César

机构信息

Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain.

出版信息

PeerJ Comput Sci. 2023 Apr 17;9:e1340. doi: 10.7717/peerj-cs.1340. eCollection 2023.

DOI:10.7717/peerj-cs.1340

PMID:37346545

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10280436/

Abstract

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.

摘要

识别转录起始位点是基因识别的关键。在诸如检测翻译起始位点或启动子等相关问题上已经采用了多种方法，其中许多最新方法基于机器学习。深度学习方法已被证明在这项任务中异常有效，但它们在转录起始位点识别中的应用尚未得到深入探索。此外，现有的极少数研究并未将其方法与支持向量机（SVM）进行比较，而支持向量机是该研究领域最成熟的技术，也未提供研究中使用的经过整理的数据集。针对这个特定问题发表的论文数量减少可能是由于缺乏数据集所致。鉴于支持向量机和深度神经网络都已应用于相关问题并取得了显著成果，我们比较了它们在转录起始位点预测中的性能，得出结论：支持向量机在计算上要慢得多，而深度学习方法，特别是长短期记忆神经网络（LSTM），比支持向量机更适合处理序列。为此，我们使用了人类参考基因组GRCh38。此外，我们研究了与数据处理相关的两个不同方面：生成训练示例的正确方法以及数据的不均衡性质。此外，还使用小鼠基因组测试了所研究模型的泛化性能，其中LSTM神经网络在其他算法中脱颖而出。综上所述，本文提供了对转录起始位点识别中最佳架构选择的分析，以及一种生成转录起始位点数据集的方法，该数据集包括Ensembl中任何可用物种的负实例。我们发现深度学习方法比支持向量机更适合解决这个问题，它更高效，更适合处理长序列和大量数据。我们还创建了一个足够大的转录起始位点（TSS）数据集，可用于深度学习实验。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/10280436/e0b310c4e0ff/peerj-cs-09-1340-g001.jpg

相似文献

Deep learning and support vector machines for transcription start site identification.用于转录起始位点识别的深度学习与支持向量机

PeerJ Comput Sci. 2023 Apr 17;9:e1340. doi: 10.7717/peerj-cs.1340. eCollection 2023.

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species.跨物种启动子预测中机器学习与深度学习技术的比较

PeerJ Comput Sci. 2021 Feb 9;7:e365. doi: 10.7717/peerj-cs.365. eCollection 2021.

DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data.DeepTSS：用于从 CAGE 数据中识别转录起始位点的多分支卷积神经网络。

BMC Bioinformatics. 2022 Dec 12;23(Suppl 2):395. doi: 10.1186/s12859-022-04945-y.

An Experimental Review on Deep Learning Architectures for Time Series Forecasting.深度学习架构在时间序列预测中的实验研究综述

Int J Neural Syst. 2021 Mar;31(3):2130001. doi: 10.1142/S0129065721300011. Epub 2021 Feb 16.

A survey on protein-DNA-binding sites in computational biology.计算生物学中蛋白质-DNA 结合位点研究综述。

Brief Funct Genomics. 2022 Sep 16;21(5):357-375. doi: 10.1093/bfgp/elac009.

Optimizing neural networks for medical data sets: A case study on neonatal apnea prediction.优化神经网络在医学数据集上的应用：以新生儿呼吸暂停预测为例的研究

Artif Intell Med. 2019 Jul;98:59-76. doi: 10.1016/j.artmed.2019.07.008. Epub 2019 Jul 25.

Genome annotation across species using deep convolutional neural networks.使用深度卷积神经网络对跨物种的基因组进行注释。

PeerJ Comput Sci. 2020 Jun 15;6:e278. doi: 10.7717/peerj-cs.278. eCollection 2020.

A new ensemble residual convolutional neural network for remaining useful life estimation.一种新的集成残差卷积神经网络用于剩余使用寿命估计。

Math Biosci Eng. 2019 Jan 28;16(2):862-880. doi: 10.3934/mbe.2019040.

Fast modular network implementation for support vector machines.支持向量机的快速模块化网络实现

IEEE Trans Neural Netw. 2005 Nov;16(6):1651-63. doi: 10.1109/TNN.2005.857952.

A successful hybrid deep learning model aiming at promoter identification.一个成功的混合深度学习模型，旨在进行启动子识别。

BMC Bioinformatics. 2022 May 31;23(Suppl 1):206. doi: 10.1186/s12859-022-04735-6.

引用本文的文献

micRoclean: an R package for decontaminating low-biomass 16S-rRNA microbiome data.micRoclean：一个用于净化低生物量16S-rRNA微生物组数据的R包。

Front Bioinform. 2025 May 8;5:1556361. doi: 10.3389/fbinf.2025.1556361. eCollection 2025.

From Sequence to Solution: Intelligent Learning Engine Optimization in Drug Discovery and Protein Analysis.从序列到解决方案：药物发现与蛋白质分析中的智能学习引擎优化

BioTech (Basel). 2024 Sep 1;13(3):33. doi: 10.3390/biotech13030033.

本文引用的文献

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species.跨物种启动子预测中机器学习与深度学习技术的比较

PeerJ Comput Sci. 2021 Feb 9;7:e365. doi: 10.7717/peerj-cs.365. eCollection 2021.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

A deep learning framework combined with word embedding to identify DNA replication origins.深度学习框架结合词嵌入技术识别 DNA 复制起点

Sci Rep. 2021 Jan 12;11(1):844. doi: 10.1038/s41598-020-80670-x.

DeepTFactor: A deep learning-based tool for the prediction of transcription factors.DeepTFactor：一种基于深度学习的转录因子预测工具。

Proc Natl Acad Sci U S A. 2021 Jan 12;118(2). doi: 10.1073/pnas.2021171118.

Ensembl 2021.Ensembl 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891. doi: 10.1093/nar/gkaa942.

Benchmarking Bacterial Promoter Prediction Tools: Potentialities and Limitations.细菌启动子预测工具的基准测试：潜力与局限

mSystems. 2020 Aug 25;5(4):e00439-20. doi: 10.1128/mSystems.00439-20.

Enhancer prediction in the human genome by probabilistic modelling of the chromatin feature patterns.通过对染色质特征模式的概率建模来预测人类基因组中的增强子。

BMC Bioinformatics. 2020 Jul 20;21(1):317. doi: 10.1186/s12859-020-03621-3.

Floating Search Methodology for Combining Classification Models for Site Recognition in DNA Sequences.用于 DNA 序列中站点识别的分类模型组合的浮动搜索方法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2471-2482. doi: 10.1109/TCBB.2020.2974221. Epub 2021 Dec 8.

Solving the transcription start site identification problem with ADAPT-CAGE: a Machine Learning algorithm for the analysis of CAGE data.使用 ADAPT-CAGE 解决转录起始位点识别问题：一种用于 CAGE 数据分析的机器学习算法。

Sci Rep. 2020 Jan 21;10(1):877. doi: 10.1038/s41598-020-57811-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于转录起始位点识别的深度学习与支持向量机

Deep learning and support vector machines for transcription start site identification.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献