Suppr超能文献

使用平滑样条对测序错误率进行经验估计。

Empirical estimation of sequencing error rates using smoothing splines.

作者信息

Zhu Xuan, Wang Jian, Peng Bo, Shete Sanjay

机构信息

Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.

Department of Bioinformatics & Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.

出版信息

BMC Bioinformatics. 2016 Apr 22;17:177. doi: 10.1186/s12859-016-1052-3.

Abstract

BACKGROUND

Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.

RESULTS

We performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.

CONCLUSIONS

The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.

摘要

背景

研究人员已使用下一代测序技术来解决各种生物学问题,例如发现多态性和突变以及进行微小RNA分析。然而,与传统测序相比,下一代测序的错误率通常更高,这会影响下游的基因组分析。最近,Wang等人(《BMC生物信息学》13:185,2012年)提出了一种影子回归方法,基于测序读数数量与含错误读数数量(称为影子)之间存在线性关系的假设来估计下一代测序数据的错误率。然而,这种线性读数 - 影子关系可能并不适用于所有类型的序列数据。因此,有必要在不假设线性关系的情况下以更可靠的方式估计错误率。我们提出了一种经验错误率估计方法,该方法采用三次样条和稳健平滑样条来模拟测序读数数量与影子数量之间的关系。

结果

我们使用基于频率的方法进行模拟研究,直接生成读数和影子计数,这可以模拟真实的序列计数数据结构。通过模拟,我们研究了所提出方法的性能,并将其与影子线性回归的性能进行了比较。在所测试的所有场景中,所提出的方法比影子线性回归方法提供了更准确的错误率估计。我们还将所提出的方法应用于评估来自微阵列质量控制项目、突变筛选研究、DNA元件百科全书项目和噬菌体PhiX DNA样本的序列数据的错误率。

结论

所提出的经验错误率估计方法不假设无错误读数与影子计数之间存在线性关系,并为下一代短读测序数据提供了更准确的错误率估计。

相似文献

1
Empirical estimation of sequencing error rates using smoothing splines.
BMC Bioinformatics. 2016 Apr 22;17:177. doi: 10.1186/s12859-016-1052-3.
2
Estimation of sequencing error rates in short reads.
BMC Bioinformatics. 2012 Jul 30;13:185. doi: 10.1186/1471-2105-13-185.
3
PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.
Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.
5
NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors.
BMC Bioinformatics. 2018 Dec 20;19(1):536. doi: 10.1186/s12859-018-2579-2.
7
A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads.
Genes (Basel). 2019 Jan 14;10(1):44. doi: 10.3390/genes10010044.
8
In search of perfect reads.
BMC Bioinformatics. 2015;16 Suppl 17(Suppl 17):S7. doi: 10.1186/1471-2105-16-S17-S7. Epub 2015 Dec 7.
9
Genome assembly using Nanopore-guided long and error-free DNA reads.
BMC Genomics. 2015 Apr 20;16(1):327. doi: 10.1186/s12864-015-1519-z.

引用本文的文献

1
coiaf: Directly estimating complexity of infection with allele frequencies.
PLoS Comput Biol. 2023 Jun 9;19(6):e1010247. doi: 10.1371/journal.pcbi.1010247. eCollection 2023 Jun.
2
Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool.
BMC Bioinformatics. 2016 Dec 23;17(Suppl 17):475. doi: 10.1186/s12859-016-1326-9.

本文引用的文献

1
Application of next-generation sequencing technology in forensic science.
Genomics Proteomics Bioinformatics. 2014 Oct;12(5):190-7. doi: 10.1016/j.gpb.2014.09.001. Epub 2014 Oct 14.
2
Sequencing pools of individuals - mining genome-wide polymorphism data without big funding.
Nat Rev Genet. 2014 Nov;15(11):749-63. doi: 10.1038/nrg3803. Epub 2014 Sep 23.
3
Ten years of next-generation sequencing technology.
Trends Genet. 2014 Sep;30(9):418-26. doi: 10.1016/j.tig.2014.07.001. Epub 2014 Aug 6.
4
Quality control of next-generation sequencing data without a reference.
Front Genet. 2014 May 6;5:111. doi: 10.3389/fgene.2014.00111. eCollection 2014.
5
BLESS: bloom filter-based error correction solution for high-throughput sequencing reads.
Bioinformatics. 2014 May 15;30(10):1354-62. doi: 10.1093/bioinformatics/btu030. Epub 2014 Jan 21.
6
Exploring genome characteristics and sequence quality without a reference.
Bioinformatics. 2014 May 1;30(9):1228-35. doi: 10.1093/bioinformatics/btu023. Epub 2014 Jan 17.
7
Next-generation sequencing platforms.
Annu Rev Anal Chem (Palo Alto Calif). 2013;6:287-303. doi: 10.1146/annurev-anchem-062012-092628.
8
Estimation of sequencing error rates in short reads.
BMC Bioinformatics. 2012 Jul 30;13:185. doi: 10.1186/1471-2105-13-185.
10
Efficient counting of k-mers in DNA sequences using a bloom filter.
BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验