一种基于重复频率的新型DNA编码方案，用于通过深度学习预测人类和小鼠的DNA增强子。

A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning.

作者信息

Alakuş Talha Burak

机构信息

Department of Software Engineering, Faculty of Engineering, Kırklareli University, 39100 Kırklareli, Turkey.

出版信息

Biomimetics (Basel). 2023 May 23;8(2):218. doi: 10.3390/biomimetics8020218.

DOI:10.3390/biomimetics8020218

PMID:37366813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10296748/

Abstract

Recent studies have shown that DNA enhancers have an important role in the regulation of gene expression. They are responsible for different important biological elements and processes such as development, homeostasis, and embryogenesis. However, experimental prediction of these DNA enhancers is time-consuming and costly as it requires laboratory work. Therefore, researchers started to look for alternative ways and started to apply computation-based deep learning algorithms to this field. Yet, the inconsistency and unsuccessful prediction performance of computational-based approaches among various cell lines led to the investigation of these approaches as well. Therefore, in this study, a novel DNA encoding scheme was proposed, and solutions were sought to the problems mentioned and DNA enhancers were predicted with BiLSTM. The study consisted of four different stages for two scenarios. In the first stage, DNA enhancer data were obtained. In the second stage, DNA sequences were converted to numerical representations by both the proposed encoding scheme and various DNA encoding schemes including EIIP, integer number, and atomic number. In the third stage, the BiLSTM model was designed, and the data were classified. In the final stage, the performance of DNA encoding schemes was determined by accuracy, precision, recall, F1-score, CSI, MCC, G-mean, Kappa coefficient, and AUC scores. In the first scenario, it was determined whether the DNA enhancers belonged to humans or mice. As a result of the prediction process, the highest performance was achieved with the proposed DNA encoding scheme, and an accuracy of 92.16% and an AUC score of 0.85 were calculated, respectively. The closest accuracy score to the proposed scheme was obtained with the EIIP DNA encoding scheme and the result was observed as 89.14%. The AUC score of this scheme was measured as 0.87. Among the remaining DNA encoding schemes, the atomic number showed an accuracy score of 86.61%, while this rate decreased to 76.96% with the integer scheme. The AUC values of these schemes were 0.84 and 0.82, respectively. In the second scenario, it was determined whether there was a DNA enhancer and, if so, it was decided to which species this enhancer belonged. In this scenario, the highest accuracy score was obtained with the proposed DNA encoding scheme and the result was 84.59%. Moreover, the AUC score of the proposed scheme was determined as 0.92. EIIP and integer DNA encoding schemes showed accuracy scores of 77.80% and 73.68%, respectively, while their AUC scores were close to 0.90. The most ineffective prediction was performed with the atomic number and the accuracy score of this scheme was calculated as 68.27%. Finally, the AUC score of this scheme was 0.81. At the end of the study, it was observed that the proposed DNA encoding scheme was successful and effective in predicting DNA enhancers.

摘要

最近的研究表明，DNA增强子在基因表达调控中起着重要作用。它们负责不同的重要生物学元件和过程，如发育、体内平衡和胚胎发生。然而，对这些DNA增强子进行实验预测既耗时又昂贵，因为这需要实验室工作。因此，研究人员开始寻找替代方法，并开始将基于计算的深度学习算法应用于该领域。然而，基于计算的方法在各种细胞系中的不一致性和预测性能不佳也导致了对这些方法的研究。因此，在本研究中，提出了一种新颖的DNA编码方案，并针对上述问题寻求解决方案，同时利用双向长短期记忆网络（BiLSTM）对DNA增强子进行预测。该研究针对两种情况包括四个不同阶段。在第一阶段，获取DNA增强子数据。在第二阶段，通过所提出的编码方案以及包括电子离子相互作用势（EIIP）、整数和原子序数在内的各种DNA编码方案，将DNA序列转换为数字表示。在第三阶段，设计BiLSTM模型并对数据进行分类。在最后阶段，通过准确率、精确率、召回率、F1分数、综合列联系数（CSI）、马修斯相关系数（MCC）、几何均值（G-mean）、卡帕系数和曲线下面积（AUC）分数来确定DNA编码方案的性能。在第一种情况下，确定DNA增强子是属于人类还是小鼠。预测过程的结果表明，所提出的DNA编码方案实现了最高性能，分别计算出准确率为92.16%和AUC分数为0.85。与所提出方案最接近的准确率分数是通过EIIP DNA编码方案获得的，结果为89.14%。该方案的AUC分数测量为0.87。在其余的DNA编码方案中，原子序数显示准确率分数为86.61%，而整数方案的这一比率降至76.96%。这些方案的AUC值分别为0.84和0.82。在第二种情况下，确定是否存在DNA增强子，如果存在，则确定该增强子属于哪个物种。在这种情况下，所提出的DNA编码方案获得了最高准确率分数，结果为84.59%。此外，所提出方案的AUC分数确定为0.92。EIIP和整数DNA编码方案的准确率分数分别为77.80%和73.68%，而它们的AUC分数接近0.90。原子序数的预测效果最差，该方案的准确率分数计算为68.27%。最后，该方案的AUC分数为0.81。在研究结束时，观察到所提出的DNA编码方案在预测DNA增强子方面是成功且有效的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b844/10296748/88230dd31baa/biomimetics-08-00218-g001.jpg

相似文献

A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning.

Biomimetics (Basel). 2023 May 23;8(2):218. doi: 10.3390/biomimetics8020218.

Prediction of viral-host interactions of COVID-19 by computational methods.

Chemometr Intell Lab Syst. 2022 Sep 15;228:104622. doi: 10.1016/j.chemolab.2022.104622. Epub 2022 Jul 21.

EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron-ion interaction potential feature selection.

Mol Biosyst. 2017 Mar 28;13(4):767-774. doi: 10.1039/c7mb00054e.

Classification and Determination of Severity of Corneal Ulcer with Vision Transformer Based on the Analysis of Public Image Dataset of Fluorescein-Stained Corneas.

Diagnostics (Basel). 2024 Apr 9;14(8):786. doi: 10.3390/diagnostics14080786.

RicENN: Prediction of Rice Enhancers with Neural Network Based on DNA Sequences.

Interdiscip Sci. 2022 Jun;14(2):555-565. doi: 10.1007/s12539-022-00503-5. Epub 2022 Feb 21.

Prediction of Enhancers in DNA Sequence Data using a Hybrid CNN-DLSTM Model.

IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1327-1336. doi: 10.1109/TCBB.2022.3167090. Epub 2023 Apr 3.

COVID-19 diagnosis: A comprehensive review of pre-trained deep learning models based on feature extraction algorithm.

Results Eng. 2023 Jun;18:101020. doi: 10.1016/j.rineng.2023.101020. Epub 2023 Mar 16.

ADH-Enhancer: an attention-based deep hybrid framework for enhancer identification and strength prediction.

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae030.

Predicting enhancers with deep convolutional neural networks.

BMC Bioinformatics. 2017 Dec 1;18(Suppl 13):478. doi: 10.1186/s12859-017-1878-3.

Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions.

BMC Syst Biol. 2016 Aug 1;10 Suppl 2(Suppl 2):54. doi: 10.1186/s12918-016-0302-3.

引用本文的文献

HD-6mAPred: a hybrid deep learning approach for accurate prediction of N6-methyladenine sites in plant species.

PeerJ. 2025 May 15;13:e19463. doi: 10.7717/peerj.19463. eCollection 2025.

A deep learning model for DNA enhancer prediction based on nucleotide position aware feature encoding.

iScience. 2024 May 19;27(6):110030. doi: 10.1016/j.isci.2024.110030. eCollection 2024 Jun 21.

Machine and deep learning methods for predicting 3D genome organization.

ArXiv. 2024 Mar 4:arXiv:2403.03231v1.

本文引用的文献

HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac480.

A deep learning framework for enhancer prediction using word embedding and sequence generation.

Biophys Chem. 2022 Jul;286:106822. doi: 10.1016/j.bpc.2022.106822. Epub 2022 May 5.

Prediction of Enhancers in DNA Sequence Data using a Hybrid CNN-DLSTM Model.

IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1327-1336. doi: 10.1109/TCBB.2022.3167090. Epub 2023 Apr 3.

Diagnostic Value, Prognostic Value, and Immune Infiltration of Family Members in Liver Cancer: Bioinformatic Analysis.

Front Oncol. 2022 Mar 4;12:843880. doi: 10.3389/fonc.2022.843880. eCollection 2022.

BiLSTM-5mC: A Bidirectional Long Short-Term Memory-Based Approach for Predicting 5-Methylcytosine Sites in Genome-Wide DNA Promoters.

Molecules. 2021 Dec 7;26(24):7414. doi: 10.3390/molecules26247414.

i4mC-Deep: An Intelligent Predictor of N4-Methylcytosine Sites Using a Deep Learning Approach with Chemical Properties.

Genes (Basel). 2021 Jul 23;12(8):1117. doi: 10.3390/genes12081117.

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation.

BioData Min. 2021 Feb 4;14(1):13. doi: 10.1186/s13040-021-00244-z.

Promoter DNA Hypermethylation and Paradoxical Gene Activation.

Trends Cancer. 2020 May;6(5):392-406. doi: 10.1016/j.trecan.2020.02.007. Epub 2020 Mar 4.

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.

BMC Genomics. 2020 Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7.

Identification of gene specific cis-regulatory elements during differentiation of mouse embryonic stem cells: An integrative approach using high-throughput datasets.

PLoS Comput Biol. 2019 Nov 4;15(11):e1007337. doi: 10.1371/journal.pcbi.1007337. eCollection 2019 Nov.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于重复频率的新型DNA编码方案，用于通过深度学习预测人类和小鼠的DNA增强子。

A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献