Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE, The Netherlands.
Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, Amsterdam, 1081 BT, The Netherlands.
Genome Biol. 2020 Mar 11;21(1):65. doi: 10.1186/s13059-020-01963-y.
The practical use of graph-based reference genomes depends on the ability to align reads to them. Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. Instead of heuristic filtering or pruning steps to reduce the complexity, we propose CHOP, a method that constrains the search space by exploiting haplotype information, bounding the search space to the number of haplotypes so that a combinatorial explosion is prevented. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes Project.
基于图的参考基因组的实际应用取决于将读取内容与它们对齐的能力。执行这些图中的路径的子字符串查询是此任务的核心。模式长度的增加和编码变化的组合不可避免地导致搜索空间的组合爆炸。我们没有采用启发式过滤或剪枝步骤来降低复杂度,而是提出了 CHOP,这是一种通过利用单倍型信息来约束搜索空间的方法,将搜索空间限制在单倍型的数量内,从而防止组合爆炸。我们通过将其应用于基于图形的人类基因组表示,该表示编码了 1000 基因组计划报告的所有 8000 万个变体,证明了 CHOP 可以应用于大型和复杂的数据集。