Ryšavý Petr, Železný Filip
Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic.
BioData Min. 2023 Mar 27;16(1):13. doi: 10.1186/s13040-023-00329-x.
Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed.
In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.
基因序列聚类是生物信息学分析的关键部分之一。由此产生的系统发育树有助于解决许多研究问题,包括追溯物种历史、研究过去的迁移情况或追踪病毒爆发的源头。与此同时,生物学家提供的原始数据形式更多是读段或仅为重叠群水平的组装数据。因此,需要开发能够在无监督情况下处理这些数据的工具。
在本文中,我们提出了一种用于无参考系统发育分析的工具,该工具能够处理尚无成熟水平组装数据的情况。该工具允许对原始读段、重叠群以及两者的组合进行距离计算。该工具提供了序列之间的莱文斯坦距离估计,进而估计生物体之间的突变数量。与先前的研究相比,该方法的新颖之处在于新提出的读段和重叠群度量的组合、一种新的读段-重叠群映射方法以及重叠群的有效嵌入。