• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

CAT 方法在 DNA 序列相似性搜索和比对中的优化与性能分析。

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment.

机构信息

Department of Programming and Computer Technologies, Faculty of Computer Systems and Technologies, Technical University of Sofia, 1756 Sofia, Bulgaria.

出版信息

Genes (Basel). 2024 Mar 7;15(3):341. doi: 10.3390/genes15030341.

DOI:10.3390/genes15030341
PMID:38540400
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10970343/
Abstract

Bioinformatics is a rapidly developing field enabling scientific experiments via computer models and simulations. In recent years, there has been an extraordinary growth in biological databases. Therefore, it is extremely important to propose effective methods and algorithms for the fast and accurate processing of biological data. Sequence comparisons are the best way to investigate and understand the biological functions and evolutionary relationships between genes on the basis of the alignment of two or more DNA sequences in order to maximize the identity level and degree of similarity. This paper presents a new version of the pairwise DNA sequences alignment algorithm, based on a new method called CAT, where a dependency with a previous match and the closest neighbor are taken into consideration to increase the uniqueness of the CAT profile and to reduce possible collisions, i.e., two or more sequence with the same CAT profiles. This makes the proposed algorithm suitable for finding the exact match of a concrete DNA sequence in a large set of DNA data faster. In order to enable the usage of the profiles as sequence metadata, CAT profiles are generated once prior to data uploading to the database. The proposed algorithm consists of two main stages: CAT profile calculation depending on the chosen benchmark sequences and sequence comparison by using the calculated CAT profiles. Improvements in the generation of the CAT profiles are detailed and described in this paper. Block schemes, pseudo code tables, and figures were updated according to the proposed new version and experimental results. Experiments were carried out using the new version of the CAT method for DNA sequence alignment and different datasets. New experimental results regarding collisions, speed, and efficiency of the suggested new implementation are presented. Experiments related to the performance comparison with Needleman-Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. A performance analysis of the proposed algorithm based on the CAT method against the Knuth-Morris-Pratt algorithm, which has a complexity of O(n) and is widely used for biological data searching, was performed. The impact of prior matching dependencies on uniqueness for generated CAT profiles is investigated. The experimental results from sequence alignment demonstrate that the proposed CAT method-based algorithm exhibits minimal deviation, which can be deemed negligible if such deviation is considered permissible in favor of enhanced performance. It should be noted that the performance of the CAT algorithm in terms of execution time remains stable, unaffected by the length of the analyzed sequences. Hence, the primary benefit of the suggested approach lies in its rapid processing capabilities in large-scale sequence alignment, a task that traditional exact algorithms would require significantly more time to perform.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/d44345972085/genes-15-00341-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/b8ca2e16ec7b/genes-15-00341-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/dd6e845a96e3/genes-15-00341-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/e51886e80f97/genes-15-00341-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/cbd26da1f06d/genes-15-00341-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/d44345972085/genes-15-00341-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/b8ca2e16ec7b/genes-15-00341-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/dd6e845a96e3/genes-15-00341-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/e51886e80f97/genes-15-00341-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/cbd26da1f06d/genes-15-00341-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/d44345972085/genes-15-00341-g005.jpg
摘要

生物信息学是一个快速发展的领域,通过计算机模型和模拟来进行科学实验。近年来,生物数据库呈指数级增长。因此,提出有效的方法和算法来快速准确地处理生物数据是非常重要的。序列比对是在两个或更多 DNA 序列对齐的基础上,研究和理解基因的生物学功能和进化关系的最佳方法,以最大化身份水平和相似性程度。本文提出了一种新的基于 CAT 的双序列比对算法版本,其中考虑了前一个匹配和最近邻的依赖性,以增加 CAT 谱的独特性并减少可能的冲突,即两个或更多具有相同 CAT 谱的序列。这使得所提出的算法适合在大型 DNA 数据集中更快地找到具体 DNA 序列的精确匹配。为了能够将谱用作序列元数据,在将数据上传到数据库之前,CAT 谱会预先生成。所提出的算法由两个主要阶段组成:根据所选基准序列计算 CAT 谱和使用计算出的 CAT 谱进行序列比较。本文详细描述和说明了 CAT 谱生成方面的改进。根据所提出的新版本和实验结果,更新了框图、伪代码表和图形。使用新的 CAT 方法进行 DNA 序列比对和不同数据集进行了实验。提出了新的实验结果,包括冲突、速度和建议新实现的效率。使用新版本的算法重新执行了与 Needleman-Wunsch 的性能比较实验,以确认我们具有相同的性能。还对基于 CAT 方法的算法与具有 O(n)复杂度且广泛用于生物数据搜索的 Knuth-Morris-Pratt 算法的性能进行了分析。研究了先验匹配依赖性对生成的 CAT 谱的独特性的影响。序列比对的实验结果表明,基于 CAT 方法的算法表现出最小的偏差,如果考虑到允许增强性能,则这种偏差可以忽略不计。需要注意的是,CAT 算法的执行时间性能保持稳定,不受分析序列长度的影响。因此,所提出方法的主要优点在于其在大规模序列比对中的快速处理能力,这是传统精确算法需要花费更多时间才能完成的任务。

相似文献

1
Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment.CAT 方法在 DNA 序列相似性搜索和比对中的优化与性能分析。
Genes (Basel). 2024 Mar 7;15(3):341. doi: 10.3390/genes15030341.
2
A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.一种用于序列数据库比较的快速算法:应用于识别EMBL数据库中的载体污染。
Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.
3
FOGSAA: Fast Optimal Global Sequence Alignment Algorithm.FOGSAA:快速最优全局序列比对算法。
Sci Rep. 2013;3:1746. doi: 10.1038/srep01746.
4
DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.DIALIGN-T:一种改进的基于片段的多序列比对算法。
BMC Bioinformatics. 2005 Mar 22;6:66. doi: 10.1186/1471-2105-6-66.
5
CLUSS: clustering of protein sequences based on a new similarity measure.CLUSS:基于一种新的相似性度量对蛋白质序列进行聚类。
BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.
6
Super pairwise alignment (SPA): an efficient approach to global alignment for homologous sequences.超双序列比对(SPA):一种用于同源序列全局比对的高效方法。
J Comput Biol. 2002;9(3):477-86. doi: 10.1089/106652702760138574.
7
rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.拉斯巴里:优化间隔种子用于数据库搜索、读段映射和无比对序列比较
PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.
8
ABS: Sequence alignment by scanning.ABS:通过扫描进行序列比对。
Annu Int Conf IEEE Eng Med Biol Soc. 2011;2011:928-31. doi: 10.1109/IEMBS.2011.6090209.
9
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
10
Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix.使用进化速率结合氨基酸替换矩阵进行稳健的序列比对。
BMC Bioinformatics. 2015 Aug 14;16:255. doi: 10.1186/s12859-015-0688-8.

本文引用的文献

1
Using Attribution Sequence Alignment to Interpret Deep Learning Models for miRNA Binding Site Prediction.使用归因序列比对来解释用于miRNA结合位点预测的深度学习模型。
Biology (Basel). 2023 Feb 26;12(3):369. doi: 10.3390/biology12030369.
2
Heuristic Pairwise Alignment in Database Environments.启发式两两比对在数据库环境中的应用。
Genes (Basel). 2022 Nov 2;13(11):2005. doi: 10.3390/genes13112005.
3
Developments in Algorithms for Sequence Alignment: A Review.序列比对算法的发展:综述。
Biomolecules. 2022 Apr 6;12(4):546. doi: 10.3390/biom12040546.
4
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.SOAPdenovo2:一种经验丰富的、内存效率高的短读长从头组装器。
Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.
5
Striped Smith-Waterman speeds database searches six times over other SIMD implementations.条纹史密斯-沃特曼算法在数据库搜索速度上比其他单指令多数据(SIMD)实现快六倍。
Bioinformatics. 2007 Jan 15;23(2):156-61. doi: 10.1093/bioinformatics/btl582. Epub 2006 Nov 16.
6
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.空位BLAST和位置特异性迭代BLAST:新一代蛋白质数据库搜索程序。
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389.
7
Identification of common molecular subsequences.常见分子子序列的鉴定
J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.
8
A general method applicable to the search for similarities in the amino acid sequence of two proteins.一种适用于寻找两种蛋白质氨基酸序列相似性的通用方法。
J Mol Biol. 1970 Mar;48(3):443-53. doi: 10.1016/0022-2836(70)90057-4.
9
Improved tools for biological sequence comparison.用于生物序列比较的改进工具。
Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. doi: 10.1073/pnas.85.8.2444.
10
Rapid and sensitive protein similarity searches.快速且灵敏的蛋白质相似性搜索。
Science. 1985 Mar 22;227(4693):1435-41. doi: 10.1126/science.2983426.