使用平滑样条对测序错误率进行经验估计。

Empirical estimation of sequencing error rates using smoothing splines.

作者信息

Zhu Xuan, Wang Jian, Peng Bo, Shete Sanjay

机构信息

Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.

Department of Bioinformatics & Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.

出版信息

BMC Bioinformatics. 2016 Apr 22;17:177. doi: 10.1186/s12859-016-1052-3.

DOI:10.1186/s12859-016-1052-3

PMID:27102907

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4840868/

Abstract

BACKGROUND

Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.

RESULTS

We performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.

CONCLUSIONS

The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.

摘要

背景

研究人员已使用下一代测序技术来解决各种生物学问题，例如发现多态性和突变以及进行微小RNA分析。然而，与传统测序相比，下一代测序的错误率通常更高，这会影响下游的基因组分析。最近，Wang等人（《BMC生物信息学》13:185，2012年）提出了一种影子回归方法，基于测序读数数量与含错误读数数量（称为影子）之间存在线性关系的假设来估计下一代测序数据的错误率。然而，这种线性读数 - 影子关系可能并不适用于所有类型的序列数据。因此，有必要在不假设线性关系的情况下以更可靠的方式估计错误率。我们提出了一种经验错误率估计方法，该方法采用三次样条和稳健平滑样条来模拟测序读数数量与影子数量之间的关系。

结果

我们使用基于频率的方法进行模拟研究，直接生成读数和影子计数，这可以模拟真实的序列计数数据结构。通过模拟，我们研究了所提出方法的性能，并将其与影子线性回归的性能进行了比较。在所测试的所有场景中，所提出的方法比影子线性回归方法提供了更准确的错误率估计。我们还将所提出的方法应用于评估来自微阵列质量控制项目、突变筛选研究、DNA元件百科全书项目和噬菌体PhiX DNA样本的序列数据的错误率。