Illumina测序的上下文相关错误分析。

Analysis of context-dependent errors for illumina sequencing.

作者信息

Abnizova Irina, Leonard Steven, Skelly Tom, Brown Andy, Jackson David, Gourtovaia Marina, Qi Guoying, Te Boekhorst Rene, Faruque Nadeem, Lewis Kevin, Cox Tony

机构信息

Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

出版信息

J Bioinform Comput Biol. 2012 Apr;10(2):1241005. doi: 10.1142/S0219720012410053.

DOI:10.1142/S0219720012410053

PMID:22809341

Abstract

The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.

摘要

新一代短读长测序技术需要可靠的数据质量衡量标准。这些标准对于变异检测尤为重要。然而，在单核苷酸多态性（SNP）检测的特定情况下，可能会获得大量假阳性SNP。需要将假定的SNP与测序错误或其他错误区分开来。我们发现，不仅测序错误的概率（即质量值）对于区分假阳性SNP很重要，而且“纠正”此错误的条件概率（“次优调用”概率，以首次调用的概率为条件）也很重要。令人惊讶的是，大约80%的错配可以通过第二次调用“纠正”。另一种降低假阳性SNP率的方法是检索似乎容易出现测序错误的DNA基序，并为这些基序附加相应的条件质量值。我们基于碱基调用的核苷酸上下文及其错配类型，开发了几种区分序列错误和候选SNP的方法。此外，我们提出了一种简单的方法，基于错配的“次优”强度调用的条件概率来纠正大多数错配。我们为每个错配附加一个相应的被纠正的第二次调用置信度（质量值）。