Department of Statistics.
Department of Mathematics Education, Seoul National University, Seoul 08826, South Korea.
Bioinformatics. 2018 Feb 1;34(3):388-397. doi: 10.1093/bioinformatics/btx609.
Linkage disequilibrium (LD) block construction is required for research in population genetics and genetic epidemiology, including specification of sets of single nucleotide polymorphisms (SNPs) for analysis of multi-SNP based association and identification of haplotype blocks in high density sequencing data. Existing methods based on a narrow sense definition do not allow intermediate regions of low LD between strongly associated SNP pairs and tend to split high density SNP data into small blocks having high between-block correlation.
We present Big-LD, a block partition method based on interval graph modeling of LD bins which are clusters of strong pairwise LD SNPs, not necessarily physically consecutive. Big-LD uses an agglomerative approach that starts by identifying small communities of SNPs, i.e. the SNPs in each LD bin region, and proceeds by merging these communities. We determine the number of blocks using a method to find maximum-weight independent set. Big-LD produces larger LD blocks compared to existing methods such as MATILDE, Haploview, MIG ++, or S-MIG ++ and the LD blocks better agree with recombination hotspot locations determined by sperm-typing experiments. The observed average runtime of Big-LD for 13 288 240 non-monomorphic SNPs from 1000 Genomes Project autosome data (286 East Asians) is about 5.83 h, which is a significant improvement over the existing methods.
Source code and documentation are available for download at http://github.com/sunnyeesl/BigLD.
Supplementary data are available at Bioinformatics online.
连锁不平衡 (LD) 块构建对于群体遗传学和遗传流行病学的研究是必需的,包括指定一组单核苷酸多态性 (SNP) 用于分析基于多 SNP 的关联以及在高密度测序数据中识别单倍型块。现有的基于狭义定义的方法不允许在强相关 SNP 对之间存在中间低 LD 区域,并且往往会将高密度 SNP 数据分割成具有高块间相关性的小块。
我们提出了 Big-LD,这是一种基于 LD 箱的区间图建模的块划分方法,LD 箱是强成对 LD SNP 的聚类,不一定是物理上连续的。Big-LD 使用一种凝聚方法,从识别 SNP 小社区(即每个 LD 箱区域中的 SNP)开始,然后通过合并这些社区来进行。我们使用一种方法来确定块的数量,该方法用于找到最大权独立集。与 MATILDE、Haploview、MIG ⁇ ++ 或 S-MIG ⁇ ++ 等现有方法相比,Big-LD 生成的 LD 块更大,并且 LD 块与由精子分型实验确定的重组热点位置更好地一致。对于来自 1000 基因组计划常染色体数据(286 个东亚人)的 13288240 个非同义 SNPs,观察到的 Big-LD 的平均运行时间约为 5.83 小时,这与现有方法相比有了显著的改进。
源代码和文档可在 http://github.com/sunnyeesl/BigLD 下载。
补充数据可在生物信息学在线获得。