The Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
Genome Res. 2010 Feb;20(2):273-80. doi: 10.1101/gr.096388.109. Epub 2009 Dec 17.
Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate large-scale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an approximately 5% or lower false-negative rate.
从下一代测序 (NGS) 数据中准确识别遗传变异对于像 1000 基因组计划这样的大规模基因组学研究至关重要,并且对于基于这些发现的进一步遗传分析也至关重要。单核苷酸多态性 (SNP) 发现的关键挑战是区分真正的个体变异(低频发生)和测序错误(高频发生)。因此,碱基调用错误概率的知识是必不可少的。我们开发了 Atlas-SNP2,这是一种计算工具,它可以通过从训练数据集学习的逻辑回归模型检测并解释与上下文相关变量相关的系统测序错误。随后,它通过贝叶斯公式估计每个替代的后验错误概率,该公式将整体测序错误概率和估计的 SNP 率的先验知识与给定替代的逻辑回归模型的结果相结合。估计的后验 SNP 概率可用于区分真正的 SNP 和测序错误。验证结果表明,Atlas-SNP2 的假阳性率低于 10%,假阴性率约为 5%或更低。