State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
BMC Bioinformatics. 2021 May 27;22(1):282. doi: 10.1186/s12859-021-04149-w.
With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge.
We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure.
Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB .
随着精确测序和组装技术的快速发展,越来越多的高质量染色体级和单倍型解析基因组序列组装已经产生,这为计算泛基因组学提供了巨大的机会。虽然基因组图是泛基因组表示最有用的模型之一,但它们的结构复杂性使得很难直观地呈现基因组信息,例如线性参考基因组。因此,有效地和准确地分析基因组图的空间结构并协调信息仍然是一个重大挑战。
我们开发了一种新的方法,即彩色超级泡泡(cSupB),它可以克服图的复杂性,并组织一组特定于物种或群体的感兴趣的单倍型序列。基于这个模型,我们提出了一个三坐标系统,结合了偏移值、拓扑结构和样本信息。此外,cSupB 提供了一种新颖的方法,利用完整的拓扑信息和高效地检测高度相似样本中的小插入缺失(<50 bp),这可以通过模拟数据集进行验证。此外,我们证明 cSupB 可以适应复杂的循环结构。
虽然通过放宽约束条件,解决方案适用于越来越复杂的基因组图,有向无环图、模式 cSupB 和 cSupB 方法可以扩展到任何彩色有向无环图。我们预计,我们的方法将有助于分析个体单倍型变体和群体基因组多样性。我们已经开发了一个 C++程序来实现我们的方法,可在 https://github.com/eggleader/cSupB 获得。