用于无比对DNA序列相似性分析的高效词频逆文档频率方法。

Efficient TF-IDF method for alignment-free DNA sequence similarity analysis.

作者信息

Delibaş Emre

机构信息

Department of Computer Engineering, Faculty of Engineering, Sivas Cumhuriyet University, 58140, Sivas, Turkey.

出版信息

J Mol Graph Model. 2025 Jun;137:109011. doi: 10.1016/j.jmgm.2025.109011. Epub 2025 Mar 15.

DOI:10.1016/j.jmgm.2025.109011

PMID:40107030

Abstract

This study proposes a pioneering alignment-free approach for the analysis of DNA sequence similarity. The method employs the representation of DNA sequences as n-grams, a technique that involves the adaptation of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to genomic data. The primary objective of this approach is to enhance the accuracy of the results while concomitantly reducing the computational costs of the process, by ascertaining the most informative n-grams. The approach adopted in this study successfully circumvents the limitations of both traditional alignment-based and alignment-free methods, thereby demonstrating a commendable level of performance. The proposed method was tested on three different datasets and achieved high agreement with reference phylogenetic trees in the AFProject benchmark system. The results demonstrate that TF-IDF-based similarity matrices effectively capture phylogenetic relationships and significantly reduce processing time. The high accuracy rates obtained prove that the method offers a scalable and robust alternative in large genomic datasets. The method demonstrates considerable potential in DNA sequence similarity analysis, exhibiting high accuracy and low computational cost.

摘要

本研究提出了一种用于分析DNA序列相似性的开创性无比对方法。该方法将DNA序列表示为n元语法，这是一种将词频逆文档频率（TF-IDF）算法应用于基因组数据的技术。这种方法的主要目标是通过确定最具信息性的n元语法，提高结果的准确性，同时降低该过程的计算成本。本研究采用的方法成功地规避了传统比对方法和无比对方法的局限性，从而展现出了值得称赞的性能水平。所提出的方法在三个不同的数据集上进行了测试，并在AFProject基准系统中与参考系统发育树达成了高度一致。结果表明，基于TF-IDF的相似性矩阵有效地捕捉了系统发育关系，并显著减少了处理时间。所获得的高准确率证明了该方法在大型基因组数据集中提供了一种可扩展且稳健的替代方案。该方法在DNA序列相似性分析中显示出相当大的潜力，具有高精度和低计算成本。