Kolesch Fabian, Sohn Marco, Rempel Andreas, Hippel Pia, Wittler Roland
Genome Informatics, Faculty of Technology and Center for Biotechnology, Bielefeld University, 33615, Bielefeld, Germany.
Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, 33615, Bielefeld, Germany.
BMC Bioinformatics. 2025 Sep 2;26(1):227. doi: 10.1186/s12859-025-06204-2.
The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies- a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. SANS is an alignment-free, whole-genome based approach for phylogeny estimation.
Here we present a new implementation SANS ambages with a significantly increased application spectrum. It offers additional types of input data, parallelized processing, and bootstrapping. The source code (C++), documentation, and example data are freely available for download at: https://github.com/gi-bielefeld/sans . SANS can also be launched via the web-interface of the CloWM platform- free of charge, with a standard Life Science account: https://clowm.bi.denbi.de/workflows/0194b78f-9696-7402-a2b8-858508733618/ .
The new version not only shortens processing time on large datasets immensely by parallelization. Being able to also process amino acid sequences and offering a filter for low-abundant DNA read segments also enables new application cases. Bootstrapping and integrated visualization ease and enrich the interpretation of the resulting phylogenies.
可用基因组序列数据量的不断增加使得大规模比较研究成为可能。一个常见的任务是推断系统发育——如果没有相近的参考序列、基因组序列组装不完整,或者基因组数量众多以至于无法在合理时间内进行多序列比对,这将是一项具有挑战性的任务。SANS是一种基于全基因组的无比对系统发育估计方法。
在此,我们展示了一种新的实现方式SANS ambages,其应用范围显著扩大。它提供了额外的输入数据类型、并行处理和自展检验。源代码(C++)、文档和示例数据可在以下网址免费下载:https://github.com/gi-bielefeld/sans 。SANS也可以通过CloWM平台的网络界面启动——使用标准生命科学账户免费使用:https://clowm.bi.denbi.de/workflows/0194b78f-9696-7402-a2b8-858508733618/ 。
新版本不仅通过并行化极大地缩短了大型数据集的处理时间。能够处理氨基酸序列并为低丰度DNA读段提供过滤器也开启了新的应用案例。自展检验和集成可视化简化并丰富了对所得系统发育的解释。