• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于DNA序列相似性识别的快速高效算法。

A fast and efficient algorithm for DNA sequence similarity identification.

作者信息

Uddin Machbah, Islam Mohammad Khairul, Hassan Md Rakib, Jahan Farah, Baek Joong Hwan

机构信息

Department of Computer Science and Engineering, University of Chittagong, Chittagong, 4331 Bangladesh.

Department of Computer Science and Mathematics, Bangladesh Agricultural University, Mymensingh, 2202 Bangladesh.

出版信息

Complex Intell Systems. 2023;9(2):1265-1280. doi: 10.1007/s40747-022-00846-y. Epub 2022 Aug 23.

DOI:10.1007/s40747-022-00846-y
PMID:36035628
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9395857/
Abstract

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes -hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of for . We develop an efficient system for finding the positions of in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.

摘要

DNA序列相似性分析对于包括基因组分析、提取生物信息、寻找物种进化关系等在内的众多目的而言都是必要的。序列分析有两种类型,即基于比对的(AB)和无比对的(AF)。AB对于小的同源序列有效,但对于长序列来说会成为一个难题。然而,AF算法可以解决AB的主要局限性。但是现有的大多数AF方法都表现出高时间复杂度和内存消耗、较低的精度以及在基准数据集上的性能较差。为了最小化这些局限性,我们受CGR方法启发,使用二维计数矩阵开发了一种AF算法。然后我们通过分析邻居来缩小矩阵,接着使用成对距离(PD)和系统发育树方法的最佳组合来测量相似性。我们还动态选择 的值。我们开发了一个高效的系统来在计数矩阵中找到 的位置。我们将我们的系统应用于六个不同的数据集。我们在AFproject的两个基准数据集中获得了最高排名,在两个数据集(16S核糖体,18真兽类)中达到了100%的准确率,并且与现有研究数据集(戊型肝炎病毒,HIV-1)相比,在时间复杂度和内存消耗方面达到了一个里程碑。因此,基准数据集和现有研究的比较结果表明我们的方法是高度有效、高效且准确的。因此,我们的方法可以以最高的可信度用于DNA序列相似性测量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/992df2fdee9a/40747_2022_846_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/d9be70f2a638/40747_2022_846_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/13bb189430ed/40747_2022_846_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/29e7176c3e13/40747_2022_846_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/bbf85136d6e4/40747_2022_846_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/9fa2c4ef0dae/40747_2022_846_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/992df2fdee9a/40747_2022_846_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/d9be70f2a638/40747_2022_846_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/13bb189430ed/40747_2022_846_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/29e7176c3e13/40747_2022_846_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/bbf85136d6e4/40747_2022_846_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/9fa2c4ef0dae/40747_2022_846_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e115/9395857/992df2fdee9a/40747_2022_846_Fig6_HTML.jpg

相似文献

1
A fast and efficient algorithm for DNA sequence similarity identification.一种用于DNA序列相似性识别的快速高效算法。
Complex Intell Systems. 2023;9(2):1265-1280. doi: 10.1007/s40747-022-00846-y. Epub 2022 Aug 23.
2
Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment.CAT 方法在 DNA 序列相似性搜索和比对中的优化与性能分析。
Genes (Basel). 2024 Mar 7;15(3):341. doi: 10.3390/genes15030341.
3
An improved model for whole genome phylogenetic analysis by Fourier transform.一种通过傅里叶变换进行全基因组系统发育分析的改进模型。
J Theor Biol. 2015 Oct 7;382:99-110. doi: 10.1016/j.jtbi.2015.06.033. Epub 2015 Jul 4.
4
Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses.运用具有感知图像哈希的频率混沌游戏表示法进行基因序列系统发育分析。
J Mol Graph Model. 2021 Sep;107:107942. doi: 10.1016/j.jmgm.2021.107942. Epub 2021 May 23.
5
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
6
A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up.一种基于前k个n元语法匹配的新型无比对DNA序列相似性分析方法。
J Mol Graph Model. 2020 Nov;100:107693. doi: 10.1016/j.jmgm.2020.107693. Epub 2020 Aug 7.
7
Advanced methods for missing values imputation based on similarity learning.基于相似性学习的缺失值插补先进方法。
PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.
8
A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用
J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.
9
ProgSIO-MSA: Progressive-based single iterative optimization framework for multiple sequence alignment using an effective scoring system.ProgSIO-MSA:基于渐进式的单次迭代优化框架,使用有效的评分系统进行多序列比对。
J Bioinform Comput Biol. 2020 Apr;18(2):2050005. doi: 10.1142/S0219720020500055. Epub 2020 May 6.
10
Fast and accurate genome comparison using genome images: The Extended Natural Vector Method.使用基因组图像进行快速准确的基因组比较:扩展自然向量方法。
Mol Phylogenet Evol. 2019 Dec;141:106633. doi: 10.1016/j.ympev.2019.106633. Epub 2019 Sep 26.

引用本文的文献

1
PRCFX-DT: a new graph-based approach for feature selection and classification of genomic sequences.PRCFX-DT:一种基于图形的基因组序列特征选择与分类新方法。
BMC Bioinformatics. 2025 Jun 17;26(1):159. doi: 10.1186/s12859-025-06183-4.
2
bpRNA-CosMoS: a robust and efficient RNA structural comparison method using k-mer based cosine similarity.bpRNA-CosMoS:一种基于k-mer余弦相似度的强大且高效的RNA结构比较方法。
Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf108.
3
Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent.
斑驳:通过利用短读映射器和梯度下降实现高分歧下精确的双序列替换距离。
PLoS One. 2024 Mar 21;19(3):e0298834. doi: 10.1371/journal.pone.0298834. eCollection 2024.