基于信息熵的方法综合分析 DNA 序列中的外显子和内含子。

Integrated entropy-based approach for analyzing exons and introns in DNA sequences.

机构信息

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055, China.

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China.

出版信息

BMC Bioinformatics. 2019 Jun 10;20(Suppl 8):283. doi: 10.1186/s12859-019-2772-y.

Abstract

BACKGROUND

Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research.

RESULTS

In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions.

CONCLUSIONS

Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.

摘要

背景

自上世纪以来,已经开发出了许多用于分析复杂 DNA 序列的基本算法和方法,包括基于熵的定量方法。外显子和内含子是 DNA 的最显著组成部分,它们的鉴定和预测一直是最先进研究的重点。

结果

在本研究中,我们设计了一种基于熵的综合分析方法,该方法涉及改进的拓扑熵计算、基因组信号处理(GSP)方法和奇异值分解(SVD),以研究 DNA 序列中的外显子和内含子。我们优化并实现了拓扑熵和广义拓扑熵来计算 DNA 序列的复杂度,突出了重复序列的特征。通过比较外显子和内含子的数字化熵值,我们观察到它们有显著的差异。在我们将 DNA 数据转换为数值拓扑熵值后,我们应用 SVD 方法有效地研究了单个基因序列上的外显子和内含子区域。此外,还使用了五个物种的多个基因进行外显子预测。

结论

我们的方法不仅有助于探索 DNA 序列及其功能元件的复杂性,还提供了一种基于熵的 GSP 方法来分析外显子和内含子区域。我们的工作在不同物种中是可行的,并可扩展到分析 DNA 序列编码和非编码区的其他成分。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c7e/6557737/1bbf20e30902/12859_2019_2772_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索