一种从低覆盖度基因组序列估算杂合度的新方法。

A novel approach to estimating heterozygosity from low-coverage genome sequence.

机构信息

Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115.

出版信息

Genetics. 2013 Oct;195(2):553-61. doi: 10.1534/genetics.113.154500. Epub 2013 Aug 9.

DOI:10.1534/genetics.113.154500

PMID:23934885

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3781980/

Abstract

High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual are limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step that calls genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first, by its performance on simulated sequence data and, second, on real sequence data where we obtain estimates using low-coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse worldwide populations and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show how we can use filters to correct for the confounding arising from sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher-coverage data.

摘要

高通量 shotgun 测序数据原则上使得在没有 SNP 确定偏差混杂的情况下准确估计群体遗传参数成为可能。一个感兴趣的统计量是个体基因组中杂合位点的比例，这对近交和有效种群大小具有指示意义。然而，在许多情况下，个体的可用序列数据仅限于低覆盖度，从而无法自信地调用基因型，无法直接计算杂合位点的比例。在这里，我们提出了一种从低覆盖度序列数据估计个体全基因组杂合率的方法，无需中间步骤调用基因型。我们的方法共同学习个体和其他个体面板之间的共享等位基因分布，以及测序误差分布和参考偏倚。我们通过模拟序列数据的性能首先展示了我们的方法的有效性，其次，通过使用低覆盖度数据获得的与更高覆盖度数据一致的真实序列数据，我们展示了我们的方法的有效性。我们应用我们的方法来获取来自不同世界人群的 11 个人的杂合率估计值，并通过此分析揭示了局部测序覆盖度与真实基础杂合度之间的复杂依赖关系，这使得从序列数据中估计杂合度变得复杂。我们展示了如何使用滤波器来纠正由于测序深度引起的混杂。我们在实践中发现，杂合率的比值比绝对估计值更具可解释性，并表明我们获得的杂合率比值与来自更高覆盖度数据的先前估计值具有极好的一致性。

相似文献

A novel approach to estimating heterozygosity from low-coverage genome sequence.一种从低覆盖度基因组序列估算杂合度的新方法。

Genetics. 2013 Oct;195(2):553-61. doi: 10.1534/genetics.113.154500. Epub 2013 Aug 9.

Inferring Heterozygosity from Ancient and Low Coverage Genomes.从古代低覆盖度基因组推断杂合性

Genetics. 2017 Jan;205(1):317-332. doi: 10.1534/genetics.116.189985. Epub 2016 Nov 7.

A maximum-likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes.一种无重复基因型的微卫星数据等位基因缺失校正的最大似然法。

Genetics. 2012 Oct;192(2):651-69. doi: 10.1534/genetics.112.139519. Epub 2012 Jul 30.

SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data.使用低覆盖度测序数据进行多倍体的 SNP 基因分型和参数估计。

Bioinformatics. 2018 Feb 1;34(3):407-415. doi: 10.1093/bioinformatics/btx587.

Low-coverage sequencing and Wahlund effect severely bias estimates of inbreeding, heterozygosity and effective population size in North American wolves.低覆盖率测序和瓦伦德效应严重影响了北美狼近亲繁殖、杂合性和有效种群大小估计的准确性。

Mol Ecol. 2024 May 24:e17415. doi: 10.1111/mec.17415.

Comparing variant calling algorithms for target-exon sequencing in a large sample.在大样本中比较用于目标外显子测序的变异检测算法。

BMC Bioinformatics. 2015 Mar 7;16:75. doi: 10.1186/s12859-015-0489-0.

SNP calling by sequencing pooled samples.基于测序的混合样本 SNP 检测。

BMC Bioinformatics. 2012 Sep 20;13:239. doi: 10.1186/1471-2105-13-239.

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM：一种用于下一代测序研究的基于Phred分数的基因型分型方法。

Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.

Fast and accurate site frequency spectrum estimation from low coverage sequence data.从低覆盖度序列数据中快速准确地估计位点频率谱

Bioinformatics. 2015 Mar 1;31(5):720-7. doi: 10.1093/bioinformatics/btu725. Epub 2014 Oct 30.

Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.利用多个群体的等位基因频率从DNA序列数据中快速推断个体祖先。

BMC Bioinformatics. 2015 Jan 16;16:4. doi: 10.1186/s12859-014-0418-7.

引用本文的文献

Merging High-Throughput, Amplicon-Based Second and Third Generation Sequencing Data: An Integrative and Modular Data Analysis Framework for Haplotype Prediction and Output Evaluation.合并基于扩增子的高通量第二代和第三代测序数据：用于单倍型预测和输出评估的综合模块化数据分析框架

Int J Mol Sci. 2025 Apr 7;26(7):3443. doi: 10.3390/ijms26073443.

Modeling Biases from Low-Pass Genome Sequencing to Enable Accurate Population Genetic Inferences.对低通量基因组测序中的偏差进行建模以实现准确的群体遗传推断。

Mol Biol Evol. 2025 Jan 6;42(1). doi: 10.1093/molbev/msaf002.

Leveraging ancient DNA to uncover signals of natural selection in Europe lost due to admixture or drift.利用古代 DNA 揭示因混合或漂变而在欧洲失去的自然选择信号。

Nat Commun. 2024 Nov 12;15(1):9772. doi: 10.1038/s41467-024-53852-8.

Modeling biases from low-pass genome sequencing to enable accurate population genetic inferences.对低通量基因组测序中的偏差进行建模，以实现准确的群体遗传推断。

bioRxiv. 2024 Jul 23:2024.07.19.604366. doi: 10.1101/2024.07.19.604366.

Targeted phasing of 2-200 kilobase DNA fragments with a short-read sequencer and a single-tube linked-read library method.利用短读测序仪和单管连接读文库方法对 2-200kb 大小的 DNA 片段进行靶向测序。

Sci Rep. 2024 Apr 5;14(1):7988. doi: 10.1038/s41598-024-58733-0.

Allele mining, amplicon sequencing and computational prediction of Solanum melongena L. FT/TFL1 gene homologs uncovers putative variants associated to seed dormancy and germination.等位基因挖掘、扩增子测序和计算机预测茄子 FT/TFL1 基因同源物，揭示了与种子休眠和萌发相关的潜在变异。

PLoS One. 2023 May 3;18(5):e0285119. doi: 10.1371/journal.pone.0285119. eCollection 2023.

Migratory behaviour is positively associated with genetic diversity in butterflies.迁徙行为与蝴蝶的遗传多样性呈正相关。

Mol Ecol. 2023 Feb;32(3):560-574. doi: 10.1111/mec.16770. Epub 2022 Nov 23.

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing.SECEDO：基于 SNV 的亚克隆检测，使用超低覆盖度单细胞 DNA 测序。

Bioinformatics. 2022 Sep 15;38(18):4293-4300. doi: 10.1093/bioinformatics/btac510.

HaploMaker: An improved algorithm for rapid haplotype assembly of genomic sequences.HaploMaker：一种用于快速组装基因组序列单倍型的改进算法。

Gigascience. 2022 May 17;11. doi: 10.1093/gigascience/giac038.

Elucidation of the speciation history of three sister species of crown-of-thorns starfish (Acanthaster spp.) based on genomic analysis.基于基因组分析阐明三种刺冠海星（Acanthaster spp.）姐妹种的物种形成历史。

DNA Res. 2021 Aug 25;28(4). doi: 10.1093/dnares/dsab012.

本文引用的文献

A high-coverage genome sequence from an archaic Denisovan individual.古丹尼索瓦人个体的高覆盖度基因组序列。

Science. 2012 Oct 12;338(6104):222-6. doi: 10.1126/science.1224344. Epub 2012 Aug 30.

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.一种用于从测序数据中进行 SNP 调用、突变发现、关联映射和群体遗传参数估计的统计框架。

Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8.

Estimation of allele frequency and association mapping using next-generation sequencing data.利用下一代测序数据进行等位基因频率估计和关联作图。

BMC Bioinformatics. 2011 Jun 11;12:231. doi: 10.1186/1471-2105-12-231.

mlRho - a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes.mlRho - 一种用于从散弹测序的二倍体基因组估计群体突变和重组率的程序。

Mol Ecol. 2010 Mar;19 Suppl 1(Suppl 1):277-84. doi: 10.1111/j.1365-294X.2009.04482.x.

Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.从多维 SNP 频率数据推断多个群体的联合人口历史。

PLoS Genet. 2009 Oct;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. Epub 2009 Oct 23.

Fast and flexible simulation of DNA sequence data.DNA序列数据的快速灵活模拟。

Genome Res. 2009 Jan;19(1):136-42. doi: 10.1101/gr.083634.108. Epub 2008 Nov 24.

Population genetic inference from resequencing data.基于重测序数据的群体遗传推断。

Genetics. 2009 Jan;181(1):187-97. doi: 10.1534/genetics.107.080630. Epub 2008 Nov 3.

Next-generation DNA sequencing.下一代DNA测序

Nat Biotechnol. 2008 Oct;26(10):1135-45. doi: 10.1038/nbt1486.

Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects.从高覆盖度基因组测序项目中估算核苷酸多样性、不平衡系数和突变率。

Mol Biol Evol. 2008 Nov;25(11):2409-19. doi: 10.1093/molbev/msn185. Epub 2008 Aug 25.

Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals.来自多个个体的基因组序列鸟枪法组装的群体遗传分析。

Genome Res. 2008 Jul;18(7):1020-9. doi: 10.1101/gr.074187.107. Epub 2008 Apr 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。