Suppr超能文献

CAT 方法在 DNA 序列相似性搜索和比对中的优化与性能分析。

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment.

机构信息

Department of Programming and Computer Technologies, Faculty of Computer Systems and Technologies, Technical University of Sofia, 1756 Sofia, Bulgaria.

出版信息

Genes (Basel). 2024 Mar 7;15(3):341. doi: 10.3390/genes15030341.

Abstract

Bioinformatics is a rapidly developing field enabling scientific experiments via computer models and simulations. In recent years, there has been an extraordinary growth in biological databases. Therefore, it is extremely important to propose effective methods and algorithms for the fast and accurate processing of biological data. Sequence comparisons are the best way to investigate and understand the biological functions and evolutionary relationships between genes on the basis of the alignment of two or more DNA sequences in order to maximize the identity level and degree of similarity. This paper presents a new version of the pairwise DNA sequences alignment algorithm, based on a new method called CAT, where a dependency with a previous match and the closest neighbor are taken into consideration to increase the uniqueness of the CAT profile and to reduce possible collisions, i.e., two or more sequence with the same CAT profiles. This makes the proposed algorithm suitable for finding the exact match of a concrete DNA sequence in a large set of DNA data faster. In order to enable the usage of the profiles as sequence metadata, CAT profiles are generated once prior to data uploading to the database. The proposed algorithm consists of two main stages: CAT profile calculation depending on the chosen benchmark sequences and sequence comparison by using the calculated CAT profiles. Improvements in the generation of the CAT profiles are detailed and described in this paper. Block schemes, pseudo code tables, and figures were updated according to the proposed new version and experimental results. Experiments were carried out using the new version of the CAT method for DNA sequence alignment and different datasets. New experimental results regarding collisions, speed, and efficiency of the suggested new implementation are presented. Experiments related to the performance comparison with Needleman-Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. A performance analysis of the proposed algorithm based on the CAT method against the Knuth-Morris-Pratt algorithm, which has a complexity of O(n) and is widely used for biological data searching, was performed. The impact of prior matching dependencies on uniqueness for generated CAT profiles is investigated. The experimental results from sequence alignment demonstrate that the proposed CAT method-based algorithm exhibits minimal deviation, which can be deemed negligible if such deviation is considered permissible in favor of enhanced performance. It should be noted that the performance of the CAT algorithm in terms of execution time remains stable, unaffected by the length of the analyzed sequences. Hence, the primary benefit of the suggested approach lies in its rapid processing capabilities in large-scale sequence alignment, a task that traditional exact algorithms would require significantly more time to perform.

摘要

生物信息学是一个快速发展的领域,通过计算机模型和模拟来进行科学实验。近年来,生物数据库呈指数级增长。因此,提出有效的方法和算法来快速准确地处理生物数据是非常重要的。序列比对是在两个或更多 DNA 序列对齐的基础上,研究和理解基因的生物学功能和进化关系的最佳方法,以最大化身份水平和相似性程度。本文提出了一种新的基于 CAT 的双序列比对算法版本,其中考虑了前一个匹配和最近邻的依赖性,以增加 CAT 谱的独特性并减少可能的冲突,即两个或更多具有相同 CAT 谱的序列。这使得所提出的算法适合在大型 DNA 数据集中更快地找到具体 DNA 序列的精确匹配。为了能够将谱用作序列元数据,在将数据上传到数据库之前,CAT 谱会预先生成。所提出的算法由两个主要阶段组成:根据所选基准序列计算 CAT 谱和使用计算出的 CAT 谱进行序列比较。本文详细描述和说明了 CAT 谱生成方面的改进。根据所提出的新版本和实验结果,更新了框图、伪代码表和图形。使用新的 CAT 方法进行 DNA 序列比对和不同数据集进行了实验。提出了新的实验结果,包括冲突、速度和建议新实现的效率。使用新版本的算法重新执行了与 Needleman-Wunsch 的性能比较实验,以确认我们具有相同的性能。还对基于 CAT 方法的算法与具有 O(n)复杂度且广泛用于生物数据搜索的 Knuth-Morris-Pratt 算法的性能进行了分析。研究了先验匹配依赖性对生成的 CAT 谱的独特性的影响。序列比对的实验结果表明,基于 CAT 方法的算法表现出最小的偏差,如果考虑到允许增强性能,则这种偏差可以忽略不计。需要注意的是,CAT 算法的执行时间性能保持稳定,不受分析序列长度的影响。因此,所提出方法的主要优点在于其在大规模序列比对中的快速处理能力,这是传统精确算法需要花费更多时间才能完成的任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f50a/10970343/b8ca2e16ec7b/genes-15-00341-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验