Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA.
Genome Biol. 2020 Oct 16;21(1):265. doi: 10.1186/s13059-020-02168-z.
The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.
近年来测序技术的进步使得人们能够将单个基因组组装到参考基因组的质量水平。如何整合来自同一物种的多个基因组,并使整合后的表示形式对生物学家可用,仍然是一个开放性挑战。在这里,我们提出了一种基于图的数据模型和相关格式,用于表示多个基因组,同时保留线性参考基因组的坐标。我们在 minigraph 工具包中实现了我们的想法,并证明我们可以有效地构建一个泛基因组图,并紧凑地编码当前参考基因组中缺失的数以万计的结构变体。