United States Department of Agriculture-Agricultural Research Service, Robert W. Holley Center, Ithaca, NY 14853, USA.
Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA.
Bioinformatics. 2022 Aug 2;38(15):3698-3702. doi: 10.1093/bioinformatics/btac410.
Pangenomes provide novel insights for population and quantitative genetics, genomics and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data.
The Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin or R) and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1× coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity.
All resources listed here are freely available. The PHG Docker used to generate the simulation results is https://hub.docker.com/ as maizegenetics/phg:0.0.27. PHG source code is at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/src/master/. The code used for the analysis of simulated data is at https://bitbucket.org/bucklerlab/phg-manuscript/src/master/. The PHG database of NAM parent haplotypes is in the CyVerse data store (https://de.cyverse.org/de/) and named/iplant/home/shared/panzea/panGenome/PHG_db_maize/phg_v5Assemblies_20200608.db.
Supplementary data are available at Bioinformatics online.
泛基因组为群体和数量遗传学、基因组学和育种提供了新的见解,这些见解是从研究单个参考基因组无法获得的。相反,一个物种最好由泛基因组或基因组集合来代表。不幸的是,管理和使用基因组多样化的物种的泛基因组在计算和实践上都具有挑战性。我们开发了一种基于参考基因组的树状图表示方法,可以很好地表示大多数泛基因组,并可用于从低密度序列或变体数据中推断完整的基因组。
实用单体型图(PHG)是一个泛基因组管道、数据库(PostGRES 和 SQLite)、数据模型(Java、Kotlin 或 R)和育种 API(BrAPI)网络服务。PHG 已经能够准确地表示包括玉米在内的四个主要作物的多样性,玉米是基因组最多样化的物种之一,数据压缩高达 1000 倍。使用模拟数据,我们表明,即使在覆盖率仅为 0.1×的情况下,使用适当的读取和序列比对,推断结果也能非常准确地重建单体型。PHG 是理解和应用基因组多样性的平台和环境。
这里列出的所有资源都是免费提供的。用于生成模拟结果的 PHG Docker 可在 https://hub.docker.com/ 上找到,名称为 maizegenetics/phg:0.0.27。PHG 源代码可在 https://bitbucket.org/bucklerlab/practicalhaplotypegraph/src/master/ 找到。用于模拟数据分析的代码可在 https://bitbucket.org/bucklerlab/phg-manuscript/src/master/ 找到。NAM 亲本单体型的 PHG 数据库位于 CyVerse 数据存储库(https://de.cyverse.org/de/)中,名称为/iplant/home/shared/panzea/panGenome/PHG_db_maize/phg_v5Assemblies_20200608.db。
补充数据可在 Bioinformatics 在线获取。