• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

跨物种启动子预测中机器学习与深度学习技术的比较

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species.

作者信息

Bhandari Nikita, Khare Satyajeet, Walambe Rahee, Kotecha Ketan

机构信息

Computer Science, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India.

Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, MH, India.

出版信息

PeerJ Comput Sci. 2021 Feb 9;7:e365. doi: 10.7717/peerj-cs.365. eCollection 2021.

DOI:10.7717/peerj-cs.365
PMID:33817015
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7959599/
Abstract

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), (plant) and human (Homo ). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

摘要

基因启动子是位于转录起始位点周围的关键DNA调控元件,负责调控基因转录过程。已有多种基于比对、基于信号和基于内容的方法用于启动子预测。然而,由于并非所有启动子序列都具有明显特征,这些技术的预测性能较差。因此,人们提出了许多机器学习和深度学习模型用于启动子预测。在这项工作中,我们研究了使用三种不同高等真核生物(即酵母(酿酒酵母)、植物和人类)的基因组序列进行向量编码和启动子分类的方法。我们在一维卷积神经网络(CNN)模型上比较了独热向量编码方法和基于频率的词元化(FBT)用于数据预处理的效果。我们发现FBT能提供更短的输入维度,在不影响分类灵敏度和特异性的情况下减少训练时间。我们采用深度学习技术,主要是CNN以及带有长短期记忆(LSTM)的循环神经网络和随机森林(RF)分类器,对k-mer大小为2、4和8的启动子进行分类。我们发现CNN在从非启动子序列中分类启动子(二元分类)以及启动子序列的物种特异性分类(多类分类)方面表现更优。总之,这项工作的贡献在于使用合成洗牌负数据集和基于频率的词元化进行预处理。本研究为基因组应用中的分类任务提供了一个全面且通用的框架,并且可以扩展到各种分类问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/078190032c2c/peerj-cs-07-365-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/1b64bca5eff8/peerj-cs-07-365-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/a6f97042172d/peerj-cs-07-365-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/314e9abed7cd/peerj-cs-07-365-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/078190032c2c/peerj-cs-07-365-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/1b64bca5eff8/peerj-cs-07-365-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/a6f97042172d/peerj-cs-07-365-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/314e9abed7cd/peerj-cs-07-365-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c43/7959599/078190032c2c/peerj-cs-07-365-g004.jpg

相似文献

1
Comparison of machine learning and deep learning techniques in promoter prediction across diverse species.跨物种启动子预测中机器学习与深度学习技术的比较
PeerJ Comput Sci. 2021 Feb 9;7:e365. doi: 10.7717/peerj-cs.365. eCollection 2021.
2
Mini-review: Recent advances in post-translational modification site prediction based on deep learning.小型综述:基于深度学习的翻译后修饰位点预测的最新进展
Comput Struct Biotechnol J. 2022 Jun 30;20:3522-3532. doi: 10.1016/j.csbj.2022.06.045. eCollection 2022.
3
Integration of Random Forest Classifiers and Deep Convolutional Neural Networks for Classification and Biomolecular Modeling of Cancer Driver Mutations.随机森林分类器与深度卷积神经网络的集成用于癌症驱动突变的分类和生物分子建模
Front Mol Biosci. 2019 Jun 11;6:44. doi: 10.3389/fmolb.2019.00044. eCollection 2019.
4
DeepD2V: A Novel Deep Learning-Based Framework for Predicting Transcription Factor Binding Sites from Combined DNA Sequence.DeepD2V:一种基于深度学习的新型框架,用于从组合 DNA 序列预测转录因子结合位点。
Int J Mol Sci. 2021 May 24;22(11):5521. doi: 10.3390/ijms22115521.
5
PlncRNA-HDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles.PlncRNA-HDeep:基于两种编码方式的混合深度学习进行植物长链非编码RNA预测
BMC Bioinformatics. 2021 May 12;22(Suppl 3):242. doi: 10.1186/s12859-020-03870-2.
6
A transfer learning-based CNN and LSTM hybrid deep learning model to classify motor imagery EEG signals.一种基于迁移学习的卷积神经网络和长短期记忆网络混合深度学习模型,用于对运动想象脑电信号进行分类。
Comput Biol Med. 2022 Apr;143:105288. doi: 10.1016/j.compbiomed.2022.105288. Epub 2022 Feb 10.
7
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
8
An Investigation of Deep Learning Models for EEG-Based Emotion Recognition.基于脑电图的情绪识别深度学习模型研究
Front Neurosci. 2020 Dec 23;14:622759. doi: 10.3389/fnins.2020.622759. eCollection 2020.
9
DeePromClass: Delineator for Eukaryotic Core Promoters Employing Deep Neural Networks.DeePromClass:利用深度神经网络的真核生物核心启动子描绘器
IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):802-807. doi: 10.1109/TCBB.2022.3163418. Epub 2023 Feb 3.
10
EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction.EMDLP:用于 RNA 甲基化位点预测的集成多尺度深度学习模型。
BMC Bioinformatics. 2022 Jun 8;23(1):221. doi: 10.1186/s12859-022-04756-1.

引用本文的文献

1
Genome language modeling (GLM): a beginner's cheat sheet.基因组语言建模(GLM):初学者简易指南。
Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.
2
Navigating the Multiverse: a Hitchhiker's guide to selecting harmonization methods for multimodal biomedical data.探索多元宇宙:多模态生物医学数据协调方法选择指南
Biol Methods Protoc. 2025 Apr 17;10(1):bpaf028. doi: 10.1093/biomethods/bpaf028. eCollection 2025.
3
Evaluating Neural Network Performance in Predicting Disease Status and Tissue Source of JC Polyomavirus from Patient Isolates Based on the Hypervariable Region of the Viral Genome.

本文引用的文献

1
A genome-wide positioning systems network algorithm for in silico drug repurposing.全基因组定位系统网络算法在药物再利用的计算中。
Nat Commun. 2019 Aug 2;10(1):3476. doi: 10.1038/s41467-019-10744-6.
2
iProEP: A Computational Predictor for Predicting Promoter.iProEP:一种用于预测启动子的计算预测工具。
Mol Ther Nucleic Acids. 2019 Sep 6;17:337-346. doi: 10.1016/j.omtn.2019.05.028. Epub 2019 Jun 13.
3
DeePromoter: Robust Promoter Predictor Using Deep Learning.DeePromoter:使用深度学习的强大启动子预测器。
基于病毒基因组高变区评估神经网络在预测患者分离株中JC多瘤病毒疾病状态和组织来源方面的性能。
Viruses. 2024 Dec 25;17(1):12. doi: 10.3390/v17010012.
4
A rapid and scalable approach to build synthetic repetitive hormone-responsive promoters.一种快速且可扩展的方法来构建合成重复激素响应启动子。
Plant Biotechnol J. 2024 Jul;22(7):1942-1956. doi: 10.1111/pbi.14313. Epub 2024 Feb 21.
5
gRNA Design: How Its Evolution Impacted on CRISPR/Cas9 Systems Refinement.gRNA 设计:其进化如何影响 CRISPR/Cas9 系统的改进。
Biomolecules. 2023 Nov 24;13(12):1698. doi: 10.3390/biom13121698.
6
Deep learning and support vector machines for transcription start site identification.用于转录起始位点识别的深度学习与支持向量机
PeerJ Comput Sci. 2023 Apr 17;9:e1340. doi: 10.7717/peerj-cs.1340. eCollection 2023.
7
Designing artificial synthetic promoters for accurate, smart, and versatile gene expression in plants.设计人工合成启动子,以实现植物中精确、智能和多功能的基因表达。
Plant Commun. 2023 Jul 10;4(4):100558. doi: 10.1016/j.xplc.2023.100558. Epub 2023 Feb 9.
8
Nonlinear physics opens a new paradigm for accurate transcription start site prediction.非线性物理学为准确的转录起始位点预测开辟了新的范例。
BMC Bioinformatics. 2022 Dec 30;23(1):565. doi: 10.1186/s12859-022-05129-4.
9
TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of sp. and sp. through a state-of-the-art natural language processing model BERT.TSSNote-CyaPromBERT:通过先进的自然语言处理模型BERT开发用于高度准确的启动子预测以及[物种1]和[物种2]可视化的集成平台。
Front Genet. 2022 Nov 29;13:1067562. doi: 10.3389/fgene.2022.1067562. eCollection 2022.
10
A comprehensive survey on computational learning methods for analysis of gene expression data.关于用于基因表达数据分析的计算学习方法的全面综述。
Front Mol Biosci. 2022 Nov 7;9:907150. doi: 10.3389/fmolb.2022.907150. eCollection 2022.
Front Genet. 2019 Apr 5;10:286. doi: 10.3389/fgene.2019.00286. eCollection 2019.
4
A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer.一种用于识别指导乳腺癌治疗的基因生物标志物的机器学习方法。
Front Genet. 2019 Mar 27;10:256. doi: 10.3389/fgene.2019.00256. eCollection 2019.
5
The UCSC Genome Browser database: 2019 update.UCSC 基因组浏览器数据库:2019 年更新。
Nucleic Acids Res. 2019 Jan 8;47(D1):D853-D858. doi: 10.1093/nar/gky1095.
6
DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions.DeepGSR:一种用于识别基因组信号和区域的优化深度学习结构。
Bioinformatics. 2019 Apr 1;35(7):1125-1132. doi: 10.1093/bioinformatics/bty752.
7
iPromoter-FSEn: Identification of bacterial σ promoter sequences using feature subspace based ensemble classifier.iPromoter-FSEn:基于特征子空间的集成分类器识别细菌 σ 启动子序列。
Genomics. 2019 Sep;111(5):1160-1166. doi: 10.1016/j.ygeno.2018.07.011. Epub 2018 Jul 29.
8
System modeling reveals the molecular mechanisms of HSC cell cycle alteration mediated by Maff and Egr3 under leukemia.系统建模揭示了白血病中Maff和Egr3介导的造血干细胞细胞周期改变的分子机制。
BMC Syst Biol. 2017 Oct 3;11(Suppl 5):91. doi: 10.1186/s12918-017-0467-4.
9
A review on multiple sequence alignment from the perspective of genetic algorithm.从遗传算法角度对多序列比对的综述。
Genomics. 2017 Oct;109(5-6):419-431. doi: 10.1016/j.ygeno.2017.06.007. Epub 2017 Jun 29.
10
Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.使用卷积深度学习神经网络识别原核生物和真核生物启动子。
PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.