Suppr超能文献

从高通量测序数据推断位点频率谱:人类非同义与同义位点选择的定量分析。

Inference of site frequency spectra from high-throughput sequence data: quantification of selection on nonsynonymous and synonymous sites in humans.

机构信息

Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK.

出版信息

Genetics. 2011 Aug;188(4):931-40. doi: 10.1534/genetics.111.128355. Epub 2011 May 19.

Abstract

Sequencing errors and random sampling of nucleotide types among sequencing reads at heterozygous sites present challenges for accurate, unbiased inference of single-nucleotide polymorphism genotypes from high-throughput sequence data. Here, we develop a maximum-likelihood approach to estimate the frequency distribution of the number of alleles in a sample of individuals (the site frequency spectrum), using high-throughput sequence data. Our method assumes binomial sampling of nucleotide types in heterozygotes and random sequencing error. By simulations, we show that close to unbiased estimates of the site frequency spectrum can be obtained if the error rate per base read does not exceed the population nucleotide diversity. We also show that these estimates are reasonably robust if errors are nonrandom. We then apply the method to infer site frequency spectra for zerofold degenerate, fourfold degenerate, and intronic sites of protein-coding genes using the low coverage human sequence data produced by the 1000 Genomes Project phase-one pilot. By fitting a model to the inferred site frequency spectra that estimates parameters of the distribution of fitness effects of new mutations, we find evidence for significant natural selection operating on fourfold sites. We also find that a model with variable effects of mutations at synonymous sites fits the data significantly better than a model with equal mutational effects. Under the variable effects model, we infer that 11% of synonymous mutations are subject to strong purifying selection.

摘要

在杂合位点的测序读段中,核苷酸类型的测序错误和随机抽样给从高通量测序数据中准确、无偏地推断单核苷酸多态性基因型带来了挑战。在这里,我们开发了一种最大似然方法,用于估计个体样本中等位基因数量的频率分布(即位点频率谱),使用高通量测序数据。我们的方法假设在杂合子中核苷酸类型的二项式抽样和随机测序错误。通过模拟,我们表明如果每个碱基读取的错误率不超过群体核苷酸多样性,则可以获得接近无偏的位点频率谱估计值。我们还表明,如果错误是非随机的,这些估计值是相当稳健的。然后,我们应用该方法推断零倍简并、四倍简并和蛋白质编码基因内含子位点的位点频率谱,使用 1000 基因组计划一期试点产生的低覆盖率人类序列数据。通过拟合一个模型来推断位点频率谱,该模型估计新突变适应度效应分布的参数,我们发现四倍位点存在显著的自然选择证据。我们还发现,一个具有同义位点突变可变效应的模型比一个具有相等突变效应的模型更能显著拟合数据。在可变效应模型下,我们推断出 11%的同义突变受到强烈的纯化选择。

相似文献

4
Purifying selection in deeply conserved human enhancers is more consistent than in coding sequences.
PLoS One. 2014 Jul 25;9(7):e103357. doi: 10.1371/journal.pone.0103357. eCollection 2014.
5
Inference of the Distribution of Selection Coefficients for New Nonsynonymous Mutations Using Large Samples.
Genetics. 2017 May;206(1):345-361. doi: 10.1534/genetics.116.197145. Epub 2017 Mar 1.
7
On transition bias in mitochondrial genes of pocket gophers.
J Mol Evol. 1996 Jul;43(1):32-40. doi: 10.1007/BF02352297.
8
Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data.
Bioinformatics. 2013 Sep 15;29(18):2245-52. doi: 10.1093/bioinformatics/btt386. Epub 2013 Jul 3.
9
Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences.
Virol J. 2008 Dec 23;5:160. doi: 10.1186/1743-422X-5-160.
10
Towards realistic codon models: among site variability and dependency of synonymous and non-synonymous rates.
Bioinformatics. 2007 Jul 1;23(13):i319-27. doi: 10.1093/bioinformatics/btm176.

引用本文的文献

1
Selection on synonymous sites: the unwanted transcript hypothesis.
Nat Rev Genet. 2024 Jun;25(6):431-448. doi: 10.1038/s41576-023-00686-7. Epub 2024 Jan 31.
3
Nonsynonymous Polymorphism Counts in Bacterial Genomes: a Comparative Examination.
Appl Environ Microbiol. 2020 Dec 17;87(1). doi: 10.1128/AEM.02002-20.
4
Natural Selection Shapes Codon Usage in the Human Genome.
Am J Hum Genet. 2020 Jul 2;107(1):83-95. doi: 10.1016/j.ajhg.2020.05.011. Epub 2020 Jun 8.
5
Exonic splice regulation imposes strong selection at synonymous sites.
Genome Res. 2018 Oct;28(10):1442-1454. doi: 10.1101/gr.233999.117. Epub 2018 Aug 24.
7
Estimating the prevalence of functional exonic splice regulatory information.
Hum Genet. 2017 Sep;136(9):1059-1078. doi: 10.1007/s00439-017-1798-3. Epub 2017 Apr 12.
8
From next-generation resequencing reads to a high-quality variant data set.
Heredity (Edinb). 2017 Feb;118(2):111-124. doi: 10.1038/hdy.2016.102. Epub 2016 Oct 19.
9
Are Synonymous Sites in Primates and Rodents Functionally Constrained?
J Mol Evol. 2016 Jan;82(1):51-64. doi: 10.1007/s00239-015-9719-3. Epub 2015 Nov 12.
10
Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations.
PLoS Comput Biol. 2015 Aug 14;11(8):e1004421. doi: 10.1371/journal.pcbi.1004421. eCollection 2015 Aug.

本文引用的文献

1
A map of human genome variation from population-scale sequencing.
Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.
2
SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples.
Genome Res. 2011 Jun;21(6):952-60. doi: 10.1101/gr.113084.110. Epub 2010 Oct 27.
3
mlRho - a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes.
Mol Ecol. 2010 Mar;19 Suppl 1(Suppl 1):277-84. doi: 10.1111/j.1365-294X.2009.04482.x.
4
What can we learn about the distribution of fitness effects of new mutations from DNA sequence data?
Philos Trans R Soc Lond B Biol Sci. 2010 Apr 27;365(1544):1187-93. doi: 10.1098/rstb.2009.0266.
5
The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana.
Science. 2010 Jan 1;327(5961):92-4. doi: 10.1126/science.1180677.
6
Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error.
Genome Res. 2010 Jan;20(1):101-9. doi: 10.1101/gr.097543.109. Epub 2009 Dec 1.
8
Evolutionary processes acting on candidate cis-regulatory regions in humans inferred from patterns of polymorphism and divergence.
PLoS Genet. 2009 Aug;5(8):e1000592. doi: 10.1371/journal.pgen.1000592. Epub 2009 Aug 7.
9
Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines.
Genome Res. 2009 Jul;19(7):1195-201. doi: 10.1101/gr.091231.109. Epub 2009 May 13.
10
Estimation of allele frequencies from high-coverage genome-sequencing projects.
Genetics. 2009 May;182(1):295-301. doi: 10.1534/genetics.109.100479. Epub 2009 Mar 16.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验