Suppr超能文献

用于转录起始位点识别的深度学习与支持向量机

Deep learning and support vector machines for transcription start site identification.

作者信息

Barbero-Aparicio José A, Olivares-Gil Alicia, Díez-Pastor José F, García-Osorio César

机构信息

Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain.

出版信息

PeerJ Comput Sci. 2023 Apr 17;9:e1340. doi: 10.7717/peerj-cs.1340. eCollection 2023.

Abstract

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.

摘要

识别转录起始位点是基因识别的关键。在诸如检测翻译起始位点或启动子等相关问题上已经采用了多种方法,其中许多最新方法基于机器学习。深度学习方法已被证明在这项任务中异常有效,但它们在转录起始位点识别中的应用尚未得到深入探索。此外,现有的极少数研究并未将其方法与支持向量机(SVM)进行比较,而支持向量机是该研究领域最成熟的技术,也未提供研究中使用的经过整理的数据集。针对这个特定问题发表的论文数量减少可能是由于缺乏数据集所致。鉴于支持向量机和深度神经网络都已应用于相关问题并取得了显著成果,我们比较了它们在转录起始位点预测中的性能,得出结论:支持向量机在计算上要慢得多,而深度学习方法,特别是长短期记忆神经网络(LSTM),比支持向量机更适合处理序列。为此,我们使用了人类参考基因组GRCh38。此外,我们研究了与数据处理相关的两个不同方面:生成训练示例的正确方法以及数据的不均衡性质。此外,还使用小鼠基因组测试了所研究模型的泛化性能,其中LSTM神经网络在其他算法中脱颖而出。综上所述,本文提供了对转录起始位点识别中最佳架构选择的分析,以及一种生成转录起始位点数据集的方法,该数据集包括Ensembl中任何可用物种的负实例。我们发现深度学习方法比支持向量机更适合解决这个问题,它更高效,更适合处理长序列和大量数据。我们还创建了一个足够大的转录起始位点(TSS)数据集,可用于深度学习实验。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/10280436/e0b310c4e0ff/peerj-cs-09-1340-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验