利用DNA序列样本中的单核苷酸多态性（SNP）频率谱推断群体突变率和测序错误率。

Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences.

作者信息

Liu Xiaoming, Maxwell Taylor J, Boerwinkle Eric, Fu Yun-Xin

机构信息

Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, TX, USA.

出版信息

Mol Biol Evol. 2009 Jul;26(7):1479-90. doi: 10.1093/molbev/msp059. Epub 2009 Mar 24.

DOI:10.1093/molbev/msp059

PMID:19318520

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2734145/

Abstract

One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate theta =4N mu for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate theta, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate epsilon can be either known or unknown. In the latter case, epsilon can be estimated given an estimation of theta. Using coalescent simulation, we compared our estimators with other estimators of theta. The results showed that the GLS estimators are more efficient than other theta estimators with error, and the estimation of epsilon is usable in practice when the theta per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing theta estimators with error.

摘要

分析DNA序列样本的一个挑战是，当测序错误率较高或样本量较大时，要考虑由错误产生的不可忽略的多态性。具体而言，那些人为的序列变异会使观察到的单核苷酸多态性（SNP）频率谱产生偏差，这反过来可能会进一步使二倍体群体突变率θ = 4Nμ的估计值产生偏差。在本文中，给定来自一个群体的DNA序列随机样本中的SNP频率谱，我们提出了一种基于广义最小二乘法（GLS）来估计θ的新方法。使用这种方法时，错误率ε可以是已知的，也可以是未知的。在后一种情况下，给定θ的估计值时可以估计ε。通过合并模拟，我们将我们的估计值与其他θ估计值进行了比较。结果表明，GLS估计值比其他带误差的θ估计值更有效，并且当每碱基对的θ较小时，ε的估计在实际中是可行的。我们展示了从人类群体中采样的10kb非编码区序列估计值的应用，并为选择带误差的θ估计值提供了建议。

相似文献

Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences.

Mol Biol Evol. 2009 Jul;26(7):1479-90. doi: 10.1093/molbev/msp059. Epub 2009 Mar 24.

Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error.

Genome Res. 2010 Jan;20(1):101-9. doi: 10.1101/gr.097543.109. Epub 2009 Dec 1.

Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences.

Genetics. 1994 Dec;138(4):1375-86. doi: 10.1093/genetics/138.4.1375.

Correcting estimators of theta and Tajima's D for ascertainment biases caused by the single-nucleotide polymorphism discovery process.

Genetics. 2009 Feb;181(2):701-10. doi: 10.1534/genetics.108.094060. Epub 2008 Dec 15.

Accurate and fast methods to estimate the population mutation rate from error prone sequences.

BMC Bioinformatics. 2009 Aug 11;10:247. doi: 10.1186/1471-2105-10-247.

Simple method for analyzing the pattern of DNA polymorphism and its application to SNP data of human.

Genes Genet Syst. 2008 Aug;83(4):353-60. doi: 10.1266/ggs.83.353.

Robustness of coalescent estimators to between-lineage mutation rate variation.

Mol Biol Evol. 2006 Dec;23(12):2355-60. doi: 10.1093/molbev/msl106. Epub 2006 Sep 6.

A coalescent-based estimator of admixture from DNA sequences.

Genetics. 2006 Jul;173(3):1679-92. doi: 10.1534/genetics.105.054130. Epub 2006 Apr 19.

The number of alleles at a microsatellite defines the allele frequency spectrum and facilitates fast accurate estimation of theta.

Mol Biol Evol. 2010 Dec;27(12):2702-15. doi: 10.1093/molbev/msq164. Epub 2010 Jul 6.

Usefulness of single nucleotide polymorphism data for estimating population parameters.

Genetics. 2000 Sep;156(1):439-47. doi: 10.1093/genetics/156.1.439.

引用本文的文献

Characterizing bias in population genetic inferences from low-coverage sequencing data.

Mol Biol Evol. 2014 Mar;31(3):723-35. doi: 10.1093/molbev/mst229. Epub 2013 Nov 27.

Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications.

BMC Genomics. 2013 Aug 7;14:535. doi: 10.1186/1471-2164-14-535.

jPopGen Suite: population genetic analysis of DNA polymorphism from nucleotide sequences with errors.

Methods Ecol Evol. 2012 Aug 1;3(4):624-627. doi: 10.1111/j.2041-210X.2012.00194.x. Epub 2012 Mar 2.

Atypical at skew in Firmicute genomes results from selection and not from mutation.

PLoS Genet. 2011 Sep;7(9):e1002283. doi: 10.1371/journal.pgen.1002283. Epub 2011 Sep 15.

Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error.

Genome Res. 2010 Jan;20(1):101-9. doi: 10.1101/gr.097543.109. Epub 2009 Dec 1.

Measuring the rates of spontaneous mutation from deep and large-scale polymorphism data.

Genetics. 2009 Aug;182(4):1219-32. doi: 10.1534/genetics.109.105692. Epub 2009 Jun 15.

Site frequency spectra from genomic SNP surveys.

Theor Popul Biol. 2009 Jun;75(4):346-54. doi: 10.1016/j.tpb.2009.04.003. Epub 2009 Apr 14.

本文引用的文献

Population genetic inference from resequencing data.

Genetics. 2009 Jan;181(1):187-97. doi: 10.1534/genetics.107.080630. Epub 2008 Nov 3.

Next-generation DNA sequencing.

Nat Biotechnol. 2008 Oct;26(10):1135-45. doi: 10.1038/nbt1486.

Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects.

Mol Biol Evol. 2008 Nov;25(11):2409-19. doi: 10.1093/molbev/msn185. Epub 2008 Aug 25.

Testing for neutrality in samples with sequencing errors.

Genetics. 2008 Jul;179(3):1409-24. doi: 10.1534/genetics.107.082198. Epub 2008 Jun 18.

Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals.

Genome Res. 2008 Jul;18(7):1020-9. doi: 10.1101/gr.074187.107. Epub 2008 Apr 14.

Accounting for bias from sequencing error in population genetic estimates.

Mol Biol Evol. 2008 Jan;25(1):199-206. doi: 10.1093/molbev/msm239. Epub 2007 Nov 2.

Incorporating experimental design and error into coalescent/mutation models of population history.

Genetics. 2007 Aug;176(4):2335-42. doi: 10.1534/genetics.106.063560. Epub 2007 Jun 11.

Inference of population genetic parameters in metagenomics: a clean look at messy data.

Genome Res. 2006 Oct;16(10):1320-7. doi: 10.1101/gr.5431206. Epub 2006 Sep 5.

A genome sequencing center in every lab.

Eur J Hum Genet. 2005 Nov;13(11):1167-8. doi: 10.1038/sj.ejhg.5201504.

The patterns of natural variation in human genes.

Annu Rev Genomics Hum Genet. 2005;6:287-312. doi: 10.1146/annurev.genom.6.080604.162309.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用DNA序列样本中的单核苷酸多态性（SNP）频率谱推断群体突变率和测序错误率。

Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献