一种基于前k个n元语法匹配的新型无比对DNA序列相似性分析方法。

A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up.

作者信息

Delibaş Emre, Arslan Ahmet, Şeker Abdulkadir, Diri Banu

机构信息

Department of Computer Engineering, Faculty of Engineering, Sivas Cumhuriyet University, 58140, Sivas, Turkey.

Department of Computer Engineering, Faculty of Engineering, Selçuk University, 42250, Konya, Turkey.

出版信息

J Mol Graph Model. 2020 Nov;100:107693. doi: 10.1016/j.jmgm.2020.107693. Epub 2020 Aug 7.

DOI:10.1016/j.jmgm.2020.107693

PMID:32805559

Abstract

DNA sequence similarity analysis is an essential task in computational biology and bioinformatics. In nearly all research that explores evolutionary relationships, gene function analysis, protein structure prediction and sequence retrieving, it is necessary to perform similarity calculations. As an alternative to alignment-based sequence comparison methods, which result in high computational cost, alignment-free methods have emerged that calculate similarity by digitizing the sequence in a different space. In this paper, we proposed an alignment-free DNA sequence similarity analysis method based on top-k n-gram matches, with the prediction that common repeating DNA subsections indicate high similarity between DNA sequences. In our method, we determined DNA sequence similarities by measuring similarity among feature vectors created according to top-k n-gram match-up scores without the use of similarity functions. We applied the similarity calculation for three different DNA data sets of different lengths. The phylogenetic relationships revealed by our method show that our trees coincide almost completely with the results of the MEGA software, which is based on sequence alignment. Our findings show that a certain number of frequently recurring common sequence patterns have the power to characterize DNA sequences.

摘要

DNA序列相似性分析是计算生物学和生物信息学中的一项重要任务。在几乎所有探索进化关系、基因功能分析、蛋白质结构预测和序列检索的研究中，都有必要进行相似性计算。作为基于比对的序列比较方法（计算成本高）的替代方法，出现了通过在不同空间对序列进行数字化来计算相似性的无比对方法。在本文中，我们提出了一种基于前k个n元语法匹配的无比对DNA序列相似性分析方法，预测常见的重复DNA子序列表明DNA序列之间具有高度相似性。在我们的方法中，我们通过测量根据前k个n元语法匹配分数创建的特征向量之间的相似性来确定DNA序列相似性，而不使用相似性函数。我们将相似性计算应用于三个不同长度的DNA数据集。我们的方法揭示的系统发育关系表明，我们构建的树几乎与基于序列比对的MEGA软件的结果完全一致。我们的研究结果表明，一定数量的频繁出现的共同序列模式具有表征DNA序列的能力。