SNPrune：一种基于高度连锁不平衡的高效算法，用于修剪大型 SNP 数组和序列数据集。

SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.

机构信息

Animal Breeding and Genomics, Wageningen University & Research, P.O. Box 338, 6700 AH, Wageningen, The Netherlands.

出版信息

Genet Sel Evol. 2018 Jun 26;50(1):34. doi: 10.1186/s12711-018-0404-z.

DOI:10.1186/s12711-018-0404-z

PMID:29940846

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6019535/

Abstract

BACKGROUND

High levels of pairwise linkage disequilibrium (LD) in single nucleotide polymorphism (SNP) array or whole-genome sequence data may affect both performance and efficiency of genomic prediction models. Thus, this warrants pruning of genotyping data for high LD. We developed an algorithm, named SNPrune, which enables the rapid detection of any pair of SNPs in complete or high LD throughout the genome.

METHODS

LD, measured as the squared correlation between phased alleles (r), can only reach a value of 1 when both loci have the same count of the minor allele. Sorting loci based on the minor allele count, followed by comparison of their alleles, enables rapid detection of loci in complete LD. Detection of loci in high LD can be optimized by computing the range of the minor allele count at another locus for each possible value of the minor allele count that can yield LD values higher than a predefined threshold. This efficiently reduces the number of pairs of loci for which LD needs to be computed, instead of considering all pairwise combinations of loci. The implemented algorithm SNPrune considered bi-allelic loci either using phased alleles or allele counts as input. SNPrune was validated against PLINK on two datasets, using an r threshold of 0.99. The first dataset contained 52k SNP genotypes on 3534 pigs and the second dataset contained simulated whole-genome sequence data with 10.8 million SNPs and 2500 animals.

RESULTS

SNPrune removed a similar number of SNPs as PLINK from the pig data but SNPrune was almost 12 times faster than PLINK. From the simulated sequence data with 10.8 million SNPs, SNPrune removed 6.4 and 1.4 million SNPs due to complete and high LD. Results were very similar regardless of whether phased alleles or allele counts were used. Using allele counts and multi-threading with 10 threads, SNPrune completed the analysis in 21 min. Using a sliding window of up to 500,000 SNPs, PLINK removed ~ 43,000 less SNPs (0.6%) in the sequence data and SNPrune was 24 to 170 times faster, using one or ten threads, respectively.

CONCLUSIONS

The SNPrune algorithm developed here is able to remove SNPs in high LD throughout the genome very efficiently in large datasets.

摘要

背景

在单核苷酸多态性 (SNP) 数组或全基因组序列数据中，高水平的成对连锁不平衡 (LD) 可能会影响基因组预测模型的性能和效率。因此，有必要对基因型数据进行高 LD 修剪。我们开发了一种名为 SNPrune 的算法，该算法可以快速检测整个基因组中任何一对处于完全或高度 LD 的 SNP。

方法

LD 是指相位等位基因之间的平方相关系数 (r)，只有当两个基因座具有相同数量的次要等位基因时，LD 才能达到 1 的值。基于次要等位基因数量对基因座进行排序，然后比较它们的等位基因，可以快速检测完全 LD 的基因座。通过计算另一个基因座的次要等位基因计数的范围，对于每个可能产生高于预定义阈值的 LD 值的次要等位基因计数，可以优化对高度 LD 基因座的检测。这有效地减少了需要计算 LD 的基因座对的数量，而不是考虑基因座的所有成对组合。实现的算法 SNPrune 考虑了使用相位等位基因或等位基因计数作为输入的双等位基因基因座。我们使用 r 阈值为 0.99 在两个数据集上对 PLINK 进行了验证。第一个数据集包含 3534 头猪的 52k SNP 基因型，第二个数据集包含模拟的全基因组序列数据，其中包含 1080 万个 SNPs 和 2500 头动物。

结果

SNPrune 从猪数据中删除了与 PLINK 相似数量的 SNP，但 SNPrune 的速度几乎比 PLINK 快 12 倍。对于包含 1080 万个 SNPs 的模拟序列数据，由于完全和高度 LD，SNPrune 删除了 640 万和 140 万个 SNP。无论使用相位等位基因还是等位基因计数，结果都非常相似。使用等位基因计数和 10 个线程的多线程，SNPrune 在 21 分钟内完成了分析。使用多达 50 万个 SNPs 的滑动窗口，PLINK 在序列数据中删除了约 43000 个 (0.6%)较少的 SNP，而 SNPrune 使用一个或十个线程的速度分别快 24 到 170 倍。

结论

本文开发的 SNPrune 算法能够在大型数据集高效地去除基因组中处于高度 LD 的 SNP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5090/6019535/78bfb601011d/12711_2018_404_Fig1_HTML.jpg

相似文献

SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.SNPrune：一种基于高度连锁不平衡的高效算法，用于修剪大型 SNP 数组和序列数据集。

Genet Sel Evol. 2018 Jun 26;50(1):34. doi: 10.1186/s12711-018-0404-z.

Using selection index theory to estimate consistency of multi-locus linkage disequilibrium across populations.利用选择指数理论估计多基因座连锁不平衡在不同群体间的一致性。

BMC Genet. 2015 Jul 19;16:87. doi: 10.1186/s12863-015-0252-6.

Short communication: Characterization of the genome-wide linkage disequilibrium in 2 divergent selection lines of dairy cows.短讯：2 条奶牛选育系全基因组连锁不平衡特征。

J Dairy Sci. 2010 Jun;93(6):2775-8. doi: 10.3168/jds.2009-2613.

Contributions of linkage disequilibrium and co-segregation information to the accuracy of genomic prediction.连锁不平衡和共分离信息对基因组预测准确性的贡献。

Genet Sel Evol. 2016 Oct 11;48(1):77. doi: 10.1186/s12711-016-0255-4.

Evaluation of linkage disequilibrium measures between multi-allelic markers as predictors of linkage disequilibrium between single nucleotide polymorphisms.评估多等位基因标记之间的连锁不平衡度量作为单核苷酸多态性之间连锁不平衡预测指标的情况。

Genet Res. 2007 Feb;89(1):1-6. doi: 10.1017/S0016672307008634.

Scalable linkage-disequilibrium-based selective sweep detection: a performance guide.基于连锁不平衡的可扩展选择性清除检测：性能指南。

Gigascience. 2016 Feb 8;5:7. doi: 10.1186/s13742-016-0114-9. eCollection 2016.

Extent of linkage disequilibrium in Holstein cattle in North America.北美荷斯坦奶牛的连锁不平衡程度。

J Dairy Sci. 2008 May;91(5):2106-17. doi: 10.3168/jds.2007-0553.

Allele frequency matching between SNPs reveals an excess of linkage disequilibrium in genic regions of the human genome.单核苷酸多态性（SNP）之间的等位基因频率匹配揭示了人类基因组基因区域中存在过多的连锁不平衡。

PLoS Genet. 2006 Sep 8;2(9):e142. doi: 10.1371/journal.pgen.0020142. Epub 2006 Jul 25.

New multilocus linkage disequilibrium measure for tag SNP selection.用于标签单核苷酸多态性选择的新的多位点连锁不平衡度量方法。

J Bioinform Comput Biol. 2017 Feb;15(1):1750001. doi: 10.1142/S0219720017500019.

FastTagger: an efficient algorithm for genome-wide tag SNP selection using multi-marker linkage disequilibrium.FastTagger：一种利用多标记连锁不平衡进行全基因组标签 SNP 选择的高效算法。

BMC Bioinformatics. 2010 Jan 29;11:66. doi: 10.1186/1471-2105-11-66.

引用本文的文献

Biodiversity of Northern Italy popcorn: a study on genetic diversity and agronomic performances of traditional landraces.意大利北部爆米花的生物多样性：传统地方品种的遗传多样性和农艺性能研究

Front Plant Sci. 2025 Jun 13;16:1536714. doi: 10.3389/fpls.2025.1536714. eCollection 2025.

Statistical Inference for Maximin Effects: Identifying Stable Associations across Multiple Studies.最大最小效应的统计推断：识别多项研究中的稳定关联。

J Am Stat Assoc. 2024;119(547):1968-1984. doi: 10.1080/01621459.2023.2233162. Epub 2023 Aug 4.

Multi-scale variational autoencoder for imputation of missing values in untargeted metabolomics using whole-genome sequencing data.基于全基因组测序数据的无靶向代谢组学缺失值插补的多尺度变分自动编码器。

Comput Biol Med. 2024 Sep;179:108813. doi: 10.1016/j.compbiomed.2024.108813. Epub 2024 Jul 1.

Computing linkage disequilibrium aware genome embeddings using autoencoders.使用自动编码器计算连锁不平衡感知的基因组嵌入。

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae326.

Selective Genotyping and Phenotyping for Optimization of Genomic Prediction Models for Populations with Different Diversity.针对不同多样性群体优化基因组预测模型的选择性基因分型和表型分析。

Plants (Basel). 2024 Mar 28;13(7):975. doi: 10.3390/plants13070975.

Phylogeography, origin and population structure of the self-fertile emerging plant pathogen Phytophthora pseudosyringae.自交亲和新兴植物病原菌拟性尾孢菌的系统地理学、起源与种群结构。

Mol Plant Pathol. 2024 Apr;25(4):e13450. doi: 10.1111/mpp.13450.

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics.用于非靶向代谢组学中缺失值插补的多视图变分自编码器

ArXiv. 2024 Mar 12:arXiv:2310.07990v2.

Evidence of epistasis in regions of long-range linkage disequilibrium across five complex diseases in the UK Biobank and eMERGE datasets.在英国生物银行和 eMERGE 数据集的五个复杂疾病的长程连锁不平衡区域中发现了上位性的证据。

Am J Hum Genet. 2023 Apr 6;110(4):575-591. doi: 10.1016/j.ajhg.2023.03.007.

Genomic prediction using information across years with epistatic models and dimension reduction via haplotype blocks.利用上位性模型和单倍型块进行降维，在多年信息上进行基因组预测。

PLoS One. 2023 Mar 31;18(3):e0282288. doi: 10.1371/journal.pone.0282288. eCollection 2023.

Phylogeography and population structure of the global, wide host-range hybrid pathogen Phytophthora × cambivora.全球广寄主范围的杂交病原体樟疫霉的系统发育地理学和种群结构

IMA Fungus. 2023 Feb 23;14(1):4. doi: 10.1186/s43008-023-00109-6.

本文引用的文献

Revealing misassembled segments in the bovine reference genome by high resolution linkage disequilibrium scan.通过高分辨率连锁不平衡扫描揭示牛参考基因组中的错误组装片段。

BMC Genomics. 2016 Sep 5;17(1):705. doi: 10.1186/s12864-016-3049-8.

Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection.基于全基因组序列数据，使用拆分合并贝叶斯变量选择的高效基因组预测。

Genet Sel Evol. 2016 Jun 29;48(1):49. doi: 10.1186/s12711-016-0225-x.

Strategies for single nucleotide polymorphism (SNP) genotyping to enhance genotype imputation in Gyr (Bos indicus) dairy cattle: Comparison of commercially available SNP chips.提高吉尔（印度瘤牛）奶牛单核苷酸多态性（SNP）基因分型以增强基因型填充的策略：市售SNP芯片的比较

J Dairy Sci. 2015 Jul;98(7):4969-89. doi: 10.3168/jds.2014-9213. Epub 2015 May 7.

Second-generation PLINK: rising to the challenge of larger and richer datasets.第二代PLINK：应对更大、更丰富数据集的挑战

Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.

The extent of linkage disequilibrium in beef cattle breeds using high-density SNP genotypes.利用高密度 SNP 基因型研究肉牛品种的连锁不平衡程度。

Genet Sel Evol. 2014 Mar 24;46(1):22. doi: 10.1186/1297-9686-46-22.

Extent of linkage disequilibrium, consistency of gametic phase, and imputation accuracy within and across Canadian dairy breeds.加拿大奶牛品种内和品种间的连锁不平衡程度、配子相位一致性及填充准确性。

J Dairy Sci. 2014 May;97(5):3128-41. doi: 10.3168/jds.2013-6826. Epub 2014 Feb 26.

Linkage disequilibrium in finite populations.有限群体中的连锁不平衡。

Theor Appl Genet. 1968 Jun;38(6):226-31. doi: 10.1007/BF01245622.

Genome-wide association study of antibody response to Newcastle disease virus in chicken.鸡新城疫病毒抗体反应的全基因组关联研究。

BMC Genet. 2013 May 10;14:42. doi: 10.1186/1471-2156-14-42.

A common dataset for genomic analysis of livestock populations.一个用于家畜群体基因组分析的常见数据集。

G3 (Bethesda). 2012 Apr;2(4):429-35. doi: 10.1534/g3.111.001453. Epub 2012 Apr 1.

A genome-wide association study reveals association between common variants in an intergenic region of 4q25 and high-grade myopia in the Chinese Han population.一项全基因组关联研究揭示了中国汉族人群中位于 4q25 基因间区域的常见变异与高度近视之间的关联。

Hum Mol Genet. 2011 Jul 15;20(14):2861-8. doi: 10.1093/hmg/ddr169. Epub 2011 Apr 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SNPrune：一种基于高度连锁不平衡的高效算法，用于修剪大型 SNP 数组和序列数据集。

SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献