University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.
The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour.
PAC's open-source software is available at https://github.com/Malfoy/PAC.
序列读取档案公共数据库已达到 45 拍字节的原始序列,并且每两年核苷酸含量就会翻一番。虽然类似于 BLAST 的方法可以常规地在一小部分基因组中搜索序列,但要使庞大的公共资源可搜索,基于比对的策略是无法实现的。近年来,大量文献使用基于 k-mer 的策略解决了在广泛的序列集合中查找序列的任务。目前,最具可扩展性的方法是近似成员查询数据结构,它结合了查询小签名或变体的能力,同时可扩展到多达 10000 个真核样本的集合。结果。在这里,我们提出了 PAC,一种用于查询序列数据集集合的新的近似成员查询数据结构。PAC 索引构建以流的方式工作,除了索引本身之外,不需要任何磁盘占用。与其他压缩方法相比,在可比索引大小下,它的构建时间提高了 3-6 倍。PAC 查询只需要一次随机访问,并且在有利的情况下可以在常数时间内执行。使用有限的计算资源,我们为非常大的集合构建了 PAC。它们包括 32000 个人类 RNA-seq 样本,在 5 天内完成,整个 GenBank 细菌基因组集合在单个索引大小为 3.5 TB 的情况下完成。据我们所知,这是使用近似成员查询结构索引的最大序列集合。我们还表明,PAC 能够在不到一个小时的时间内查询 500000 个转录序列。
PAC 的开源软件可在 https://github.com/Malfoy/PAC 上获得。