Suppr超能文献

使用混合索引在序列图中进行完全敏感的种子发现。

Fully-sensitive seed finding in sequence graphs using a hybrid index.

机构信息

Center for Bioinformatics, Saarland University, Saarbrücken, Germany.

Max Planck Institute for Informatics, Saarbrücken, Germany.

出版信息

Bioinformatics. 2019 Jul 15;35(14):i81-i89. doi: 10.1093/bioinformatics/btz341.

Abstract

MOTIVATION

Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus-a property that is not exploited by extant methods.

RESULTS

We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity.

AVAILABILITY AND IMPLEMENTATION

The C++ implementation is publicly available at: https://github.com/cartoonist/psi.

摘要

动机

序列图是一种多功能的数据结构,例如,它能够表示群体中发现的遗传变异,并有助于基因组组装。序列图的读映射是许多应用程序的重要步骤,通常通过首先找到精确的种子匹配来完成,然后通过对齐来扩展。现有的寻找种子命中的方法会在复杂区域修剪图,导致信息丢失,尤其是在基因组的高度多态区域。虽然这种复杂的图结构确实会导致可能等位基因的组合爆炸,但来自二倍体个体的读取查询集每个基因座仅实现两个等位基因——这一特性尚未被现有方法利用。

结果

我们提出了泛基因组种子索引 (PSI),这是一种完全敏感的混合种子查找方法,通过将图中选定路径上的索引与查询读取上的索引相结合,充分利用了这一特性。这使得 PSI 能够找到所有的种子,同时无需修剪图。我们在模拟数据和从 1000 基因组计划数据集的变体构建的整个人类基因组图上,用不同的参数设置展示了它的性能。在这个图上,PSI 在索引大小、查询时间和灵敏度方面都优于 GCSA2。

可用性和实现

C++ 实现可在以下网址获得:https://github.com/cartoonist/psi。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0749/6612829/06e82aa6cc50/btz341f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验