Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.
Stanford Genome Technology Center West, Stanford University, Palo Alto, CA, 94304, USA.
Nucleic Acids Res. 2022 Jul 5;50(W1):W448-W453. doi: 10.1093/nar/gkac266.
K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.
K-mers 是用于基因组序列分析的短 DNA 序列。使用 K-mers 的应用程序包括基因组组装和比对。然而,这些短序列在更广泛的生物信息学中的应用具有与基因组序列数据的大规模相关的挑战。单个人类基因组组装具有数十亿个 K-mers。因此,分析 K-mer 信息的计算要求非常高,特别是在涉及完整基因组组装时。为了解决这些问题,我们开发了一种新的索引数据结构,该结构基于针对短序列键查找进行调优的哈希表。这个名为 KmerKeys 的网络应用程序为基因组组装的云计算提供了高性能、快速的查询速度。我们能够对组装进行模糊和精确的序列搜索。为了实现稳健和快速的性能,该网站实现了缓存友好的哈希表、内存映射和大规模并行处理。我们的方法采用了一种可扩展和高效的数据结构,可用于联合索引和搜索大量人类基因组组装信息。人们可以包括变体数据库及其相关元数据,例如 gnomAD 人群变体目录。此功能使未来的基因组信息能够融入测序分析。KmerKeys 可在 https://kmerkeys.dgi-stanford.org 免费访问。