Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA.
Genome Biol. 2021 Aug 19;22(1):231. doi: 10.1186/s13059-021-02442-8.
Efficiently scaling genomic variant search indexes to thousands of samples is computationally challenging due to the presence of multiple coordinate systems to avoid reference biases. We present VariantStore, a system that indexes genomic variants from multiple samples using a variation graph and enables variant queries across any sample-specific coordinate system. We show the scalability of VariantStore by indexing genomic variants from the TCGA project in 4 h and the 1000 Genomes project in 3 h. Querying for variants in a gene takes between 0.002 and 3 seconds using memory only 10% of the size of the full representation.
高效地扩展基因组变异搜索索引以适应数千个样本在计算上具有挑战性,因为存在多个坐标系以避免参考偏差。我们提出了 VariantStore,这是一个使用变异图对来自多个样本的基因组变异进行索引并支持跨任何特定于样本的坐标系进行变异查询的系统。我们通过在 4 小时内索引 TCGA 项目的基因组变异和在 3 小时内索引 1000 基因组项目的基因组变异来展示 VariantStore 的可扩展性。仅使用 10%的完整表示大小的内存进行基因中的变体查询需要 0.002 到 3 秒之间的时间。