Hwang Stephen, Brown Nathaniel K, Ahmed Omar Y, Jenike Katharine M, Kovaka Sam, Schatz Michael C, Langmead Ben
XDBio Program, Johns Hopkins University, Baltimore MD, USA.
Department of Computer Science, Johns Hopkins University, Baltimore MD, USA.
bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on -mers and de Bruijn graphs are limited to answering questions at a specific substring length . We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test -mer presence/absence (membership queries) and that count the number of genomes containing -mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5x faster than other approaches. MEMO's small index size, lack of -mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
由于高质量长读长组装技术的普及,泛基因组的数量和规模都在不断增加。然而,目前用于研究泛基因组内序列组成和保守性的方法存在局限性。基于图形泛基因组的方法需要一个计算成本高昂的多序列比对步骤,这可能会遗漏一些变异。基于k-mer和德布鲁因图的索引仅限于回答特定子串长度的问题。我们提出了最大精确匹配排序(MEMO),一种基于序列间最大精确匹配(MEMs)的泛基因组索引方法。单个MEMO索引可以处理跨越泛基因组窗口的任意长度查询。MEMO既支持测试k-mer存在与否的查询(成员查询),也支持统计窗口中包含k-mer的基因组数量的查询(保守性查询)。对于包含89个人类常染色体单倍型的泛基因组,MEMO索引占用2.04GB,比可比的KMC3索引小8.8倍,比PanKmer索引小11.4倍。通过牺牲一些计数分辨率,MEMO索引可以变得更小,我们的十分位数分辨率HPRC索引达到0.67GB。MEMO可以在13.89秒内对人类白细胞抗原基因座上的31-mer进行保守性查询,比其他方法快2.5倍。MEMO索引体积小、不依赖k-mer长度且查询效率高,使其成为研究和可视化泛基因组中子串保守性的灵活工具。