Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
BMC Bioinformatics. 2012 Jul 30;13:185. doi: 10.1186/1471-2105-13-185.
Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.
We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.
The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.
下一代测序技术的短读数据现在正在一系列研究项目中产生。 该数据的保真度可能受到多种因素的影响,因此重要的是要有简单可靠的方法来在单个实验级别监测它。
我们开发了一种快速、可扩展和准确的方法来估计短读的错误率,它还有一个额外的优点,即不需要参考基因组。 我们基于这样一个基本观察结果,即对于给定的读取,其拷贝数与相差一个或两个碱基的错误读取数量之间存在线性关系。 可以通过读取和位置将此关系的斜率转换为错误率的估计值。 我们提出了模拟研究以及对真实数据集的分析,说明了该方法的精度和准确性,并表明它比计数感兴趣的样本与参考基因组之间差异的替代方法更准确。 我们展示了这种方法如何导致检测到用于校准 Illumina 数据的 PhiX 菌株基因组中的突变。 所提出的方法在一个 R 包中实现,可以从 http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html 下载。
该方法可用于在不使用参考基因组的情况下监测单个实验级别的测序管道质量。 此外,错误率的估计值为许多下一代测序数据的应用提供了改进分析和推理的机会。