Uddin Machbah, Islam Mohammad Khairul, Hassan Md Rakib, Jahan Farah, Baek Joong Hwan
Department of Computer Science and Engineering, University of Chittagong, Chittagong, 4331 Bangladesh.
Department of Computer Science and Mathematics, Bangladesh Agricultural University, Mymensingh, 2202 Bangladesh.
Complex Intell Systems. 2023;9(2):1265-1280. doi: 10.1007/s40747-022-00846-y. Epub 2022 Aug 23.
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes -hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of for . We develop an efficient system for finding the positions of in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
DNA序列相似性分析对于包括基因组分析、提取生物信息、寻找物种进化关系等在内的众多目的而言都是必要的。序列分析有两种类型,即基于比对的(AB)和无比对的(AF)。AB对于小的同源序列有效,但对于长序列来说会成为一个难题。然而,AF算法可以解决AB的主要局限性。但是现有的大多数AF方法都表现出高时间复杂度和内存消耗、较低的精度以及在基准数据集上的性能较差。为了最小化这些局限性,我们受CGR方法启发,使用二维计数矩阵开发了一种AF算法。然后我们通过分析邻居来缩小矩阵,接着使用成对距离(PD)和系统发育树方法的最佳组合来测量相似性。我们还动态选择 的值。我们开发了一个高效的系统来在计数矩阵中找到 的位置。我们将我们的系统应用于六个不同的数据集。我们在AFproject的两个基准数据集中获得了最高排名,在两个数据集(16S核糖体,18真兽类)中达到了100%的准确率,并且与现有研究数据集(戊型肝炎病毒,HIV-1)相比,在时间复杂度和内存消耗方面达到了一个里程碑。因此,基准数据集和现有研究的比较结果表明我们的方法是高度有效、高效且准确的。因此,我们的方法可以以最高的可信度用于DNA序列相似性测量。