Patsakis Michail, Provatas Kimonas, Mouratidis Ioannis, Georgakopoulos-Soares Ilias
Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA.
Huck Institute of the Life Sciences, Pennsylvania State University, University Park, PA, USA.
ArXiv. 2024 Nov 29:arXiv:2411.19427v1.
With the rapid expansion of large-scale biological datasets, DNA and protein sequence alignments have become essential for comparative genomics and proteomics. These alignments facilitate the exploration of sequence similarity patterns, providing valuable insights into sequence conservation, evolutionary relationships and for functional analyses. Typically, sequence alignments are stored in formats such as the Multiple Alignment Format (MAF). Counting k-mer occurrences is a crucial task in many computational biology applications, but currently, there is no algorithm designed for k-mer counting in alignment files.
We have developed MAFcounter, the first k-mer counter dedicated to alignment files. MAFcounter is multithreaded, fast, and memory efficient, enabling k-mer counting in DNA and protein sequence alignment files.
The MAFcounter package and its Python bindings are released under GPL license as a multi-platform application and are available at: https://github.com/Georgakopoulos-Soares-lab/MAFcounter.
随着大规模生物数据集的迅速扩展,DNA和蛋白质序列比对已成为比较基因组学和蛋白质组学的核心。这些比对有助于探索序列相似性模式,为序列保守性、进化关系及功能分析提供宝贵见解。通常,序列比对以多种比对格式(MAF)等形式存储。在许多计算生物学应用中,统计k-mer出现次数是一项关键任务,但目前尚无专门针对比对文件进行k-mer计数的算法。
我们开发了MAFcounter,这是首个专门用于比对文件的k-mer计数器。MAFcounter是多线程的,速度快且内存效率高,能够对DNA和蛋白质序列比对文件进行k-mer计数。
MAFcounter软件包及其Python绑定以GPL许可作为多平台应用发布,可从以下网址获取:https://github.com/Georgakopoulos-Soares-lab/MAFcounter 。