Suppr超能文献

基于测序的混合样本 SNP 检测。

SNP calling by sequencing pooled samples.

机构信息

Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, 08028, Spain.

出版信息

BMC Bioinformatics. 2012 Sep 20;13:239. doi: 10.1186/1471-2105-13-239.

Abstract

BACKGROUND

Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.

RESULTS

To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.

CONCLUSIONS

We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).

摘要

背景

对来自不同个体的样本进行高通量测序是一种策略,可以以个体测序所需成本的一小部分来描述遗传变异性。在某些情况下,某些变异性估计量的方差甚至比个体测序获得的方差还要小。然而,从混合样本中进行 SNP 调用并估计次要等位基因的频率是一项微妙的工作,原因至少有三个。首先,测序错误可能比个体 SNP 调用中的错误更相关:虽然在个体测序中可以通过限制每个等位基因的最小读取次数来减少其影响,但在混合池中,这将产生强烈且不理想的影响,因为在池中低频等位基因不太可能被多次读取。其次,个体中杂合位点的先验等位基因频率通常为 0.5(假设不分析来自例如癌症组织的序列),但在混合池中并非如此:实际上,根据标准中性模型,单倍体(即频率最低的等位基因)是最常见的变异类,因为 P(f)∝1/f,并且随着样本量的增加,它们出现的频率更高。第三,在池中的读取中仅出现一次的等位基因不一定对应于构成池的个体集合中的单倍体,反之亦然,来自真正的单倍体的可能不止一个读取-或者更可能的是,没有一个读取。

结果

为了改进现有的理论和软件包,我们开发了一种用于混合池中小等位基因频率(MAF)计算和 SNP 调用的贝叶斯方法(并在名为 snape 的程序中实现了它):该方法考虑了测序错误,并允许用户选择不同的先验。我们还建立了一个管道,可以模拟导致 SNP 的合并过程,混合过程和测序过程。我们使用它来比较 snape 与其他软件包的性能。

结论

我们提出了一种用于混合样本中 SNP 调用的软件:它具有良好的功效,同时保持低假发现率(FDR)。该方法还提供了 SNP 分离的后验概率和每个 SNP 的 f 的完整后验分布。为了测试我们软件的行为,我们通过模拟合并生成了(人工)基因组,并计算了混合测序方案随后的 SNP 调用的影响。在这种设置下,snape 的功效和错误发现率(FDR)都优于可比的软件包 samtools、PoPoolation、Varscan:对于 N=50 条染色体,snape 的功效约为 35%,FDR 约为 2.5%。snape 可在 http://code.google.com/p/snape-pooled/ 获得(源代码和预编译二进制文件)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f002/3475117/db36ec549b26/1471-2105-13-239-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验