Genetic Epidemiology & Bioinformatics, Faculty of Medicine, University of Southampton, Southampton, UK.
Bioinformatics. 2019 Feb 15;35(4):541-545. doi: 10.1093/bioinformatics/bty687.
Efforts to establish reference genome sequences by de novo sequence assembly have to address the difficulty of linking relatively short sequence contigs to form much larger chromosome assemblies. Efficient strategies are required to span gaps and establish contig order and relative orientation. We consider here the use of linkage disequilibrium (LD) maps of sequenced contigs and the utility of LD for ordering, orienting and positioning linked sequences. LD maps are readily constructed from population data and have at least an order of magnitude higher resolution than linkage maps providing the potential to resolve difficult areas in assemblies. We empirically evaluate a linkage disequilibrium map-based method using single nucleotide polymorphism genotype data in a 216 kilobase region of human 6p21.3 from which three shorter contigs are formed.
LD map length is most informative about the correct order and orientation and is suggested by the shortest LD map where the residual error variance is close to one. For regions in strong LD this method may be less informative for correcting inverted contigs than for identifying correct contig orders. For positioning two contigs in linkage disequilibrium with each other the inter-contig distances may be roughly estimated by this method.
The LDMAP program is written in C for a linux platform and is available at https://www.soton.ac.uk/genomicinformatics/research/ld.page.
Supplementary data are available at Bioinformatics online.
通过从头测序组装来建立参考基因组序列的努力必须解决将相对较短的序列片段链接形成更大染色体组装的困难。需要有效的策略来跨越间隙并建立连续顺序和相对方向。我们在这里考虑使用已测序连续体的连锁不平衡 (LD) 图谱,以及 LD 在排序、定向和定位连锁序列方面的效用。LD 图谱可以从群体数据中轻松构建,并且分辨率至少高出一个数量级,比提供解决组装中困难区域的潜力的连锁图谱更高。我们使用来自人类 6p21.3 中 216 千碱基区域的单核苷酸多态性基因型数据,对基于连锁不平衡图谱的方法进行了实证评估,该区域形成了三个较短的连续体。
LD 图谱的长度对于正确的顺序和方向最有信息,并且在最短 LD 图谱中,残差方差接近 1。对于处于强 LD 中的区域,与识别正确的连续体顺序相比,该方法可能对纠正倒置连续体的信息较少。对于彼此处于连锁不平衡的两个连续体的定位,通过这种方法可以大致估计它们之间的连续体距离。
LDMAP 程序是用 C 语言编写的,适用于 Linux 平台,可在 https://www.soton.ac.uk/genomicinformatics/research/ld.page 上获得。
补充数据可在 Bioinformatics 在线获得。