Shen Wei, Lees John A, Iqbal Zamin
Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, China.
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
Nat Biotechnol. 2025 Sep 10. doi: 10.1038/s41587-025-02812-8.
The size of microbial sequence databases continues to grow beyond the abilities of existing alignment tools. We introduce LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate-length sequences (>250 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes. We construct a small set of probe k-mers, which are selected to efficiently sample the entire database to be indexed such that every 250-bp window of each database genome contains multiple seed k-mers, each with a shared prefix with one of the probes. Storing these seeds in a hierarchical index enables fast and low-memory alignment. We benchmark both accuracy and potential to scale to databases of millions of bacterial genomes, showing that LexicMap achieves comparable accuracy to state-of-the-art methods but with greater speed and lower memory use. Our method supports querying at scale and within minutes, which will be useful for many biological applications across epidemiology, ecology and evolution.
微生物序列数据库的规模持续增长,超出了现有比对工具的能力范围。我们推出了LexicMap,这是一种核苷酸序列比对工具,用于高效查询中等长度序列(>250 bp),例如基因、质粒或长读段,可与多达数百万个原核生物基因组进行比对。我们构建了一小套探针k-mer,这些探针经过挑选,能够有效地对整个待索引数据库进行采样,使得每个数据库基因组的每250 bp窗口都包含多个种子k-mer,每个种子k-mer都与其中一个探针有共享前缀。将这些种子存储在分层索引中可实现快速且低内存的比对。我们对准确性和扩展到数百万个细菌基因组数据库的潜力进行了基准测试,结果表明LexicMap与最先进的方法具有相当的准确性,但速度更快且内存使用更低。我们的方法支持大规模且在数分钟内进行查询,这将对流行病学、生态学和进化等众多生物学应用有用。