重复感知建模和短读错误纠正。

Repeat-aware modeling and correction of short read errors.

机构信息

Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011, USA.

出版信息

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S52. doi: 10.1186/1471-2105-12-S1-S52.

DOI:10.1186/1471-2105-12-S1-S52

PMID:21342585

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3044310/

Abstract

BACKGROUND

High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.

RESULTS

We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content.

AVAILABILITY

The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem".

CONCLUSIONS

We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.

摘要

背景

高通量短读测序通过实现经济高效的基因组和转录组深度覆盖测序，正在彻底改变基因组学和系统生物学研究。错误检测和纠正对于许多短读测序应用至关重要，包括从头基因组测序、基因组重测序和数字基因表达分析。短读错误检测通常通过计算读段中观测到的 kmers 的频率并验证那些频率超过阈值的 kmer 来完成。在具有高重复含量的基因组中，如果一个 kmer 与基因组中多次出现的具有多个核苷酸差异的有效 kmer 非常相似，则可能会频繁观察到错误的 kmer。错误检测和纠正主要应用于低重复含量的基因组，而对于高重复含量的基因组，这仍然是一个具有挑战性的问题。

结果

我们开发了一种统计模型和一种在存在基因组重复的情况下进行错误检测和纠正的计算方法。我们提出了一种从观察到的 kmers 的观察频率推断 kmers 的基因组频率的方法，通过分析观察到的 kmers 之间的误读关系。我们还提出了一种估计阈值的方法，该阈值可用于验证估计的基因组频率超过该阈值的 kmers。我们证明了使用这些方法可以实现更好的错误检测。此外，我们打破了在一个读段内错误均匀分布的常见假设，并提供了一种框架，用于建模许多短读平台常见的位置相关错误发生频率。最后，我们在具有高重复含量的基因组中实现了更好的错误纠正。

可用性

该软件是用 C++实现的，可在“http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”以 GNU GPL3 许可证和 Boost Software V1.0 许可证免费获得。

结论

我们引入了一种统计框架来模拟下一代读取中的测序错误，这为检测和纠正具有高重复含量的基因组中的错误提供了有前景的结果。

相似文献

Repeat-aware modeling and correction of short read errors.重复感知建模和短读错误纠正。

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S52. doi: 10.1186/1471-2105-12-S1-S52.

QuorUM: An Error Corrector for Illumina Reads.QuorUM：Illumina测序读数的纠错工具

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.ARKS：基于链接读取子的人类基因组草图染色体级 scaffolding。

BMC Bioinformatics. 2018 Jun 20;19(1):234. doi: 10.1186/s12859-018-2243-x.

ARAMIS: From systematic errors of NGS long reads to accurate assemblies.ARAMIS：从 NGS 长读的系统误差到精确组装。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab170.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。

BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

EC: an efficient error correction algorithm for short reads.EC：一种用于短读段的高效纠错算法。

BMC Bioinformatics. 2015;16 Suppl 17(Suppl 17):S2. doi: 10.1186/1471-2105-16-S17-S2. Epub 2015 Dec 7.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.NeatFreq：用于从头序列组装的无参考数据缩减和覆盖度归一化

BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.

A survey of error-correction methods for next-generation sequencing.下一代测序错误纠正方法综述。

Brief Bioinform. 2013 Jan;14(1):56-66. doi: 10.1093/bib/bbs015. Epub 2012 Apr 6.

引用本文的文献

A comparative evaluation of hybrid error correction methods for error-prone long reads.对易错长读进行混合纠错方法的比较评估。

Genome Biol. 2019 Feb 4;20(1):26. doi: 10.1186/s13059-018-1605-z.

DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing.DUDE-Seq：用于靶向扩增子测序的快速、灵活且稳健的去噪方法

PLoS One. 2017 Jul 27;12(7):e0181463. doi: 10.1371/journal.pone.0181463. eCollection 2017.

Optimization of high-throughput sequencing kinetics for determining enzymatic rate constants of thousands of RNA substrates.用于确定数千种RNA底物酶促速率常数的高通量测序动力学优化。

Anal Biochem. 2016 Oct 1;510:1-10. doi: 10.1016/j.ab.2016.06.004. Epub 2016 Jun 11.

In search of perfect reads.寻找完美的读数。

BMC Bioinformatics. 2015;16 Suppl 17(Suppl 17):S7. doi: 10.1186/1471-2105-16-S17-S7. Epub 2015 Dec 7.

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.去噪DNA深度测序数据——高通量测序错误及其校正

Brief Bioinform. 2016 Jan;17(1):154-79. doi: 10.1093/bib/bbv029. Epub 2015 May 29.

BLESS: bloom filter-based error correction solution for high-throughput sequencing reads.BLESS：基于布隆过滤器的高通量测序读错误纠正解决方案。

Bioinformatics. 2014 May 15;30(10):1354-62. doi: 10.1093/bioinformatics/btu030. Epub 2014 Jan 21.

Sequencing error correction without a reference genome.无参考基因组的测序错误纠正。

BMC Bioinformatics. 2013 Dec 18;14:367. doi: 10.1186/1471-2105-14-367.

DRISEE overestimates errors in metagenomic sequencing data.DRISEE高估了宏基因组测序数据中的误差。

Brief Bioinform. 2014 Sep;15(5):783-7. doi: 10.1093/bib/bbt010. Epub 2013 May 22.

Denoising PCR-amplified metagenome data.对 PCR 扩增的宏基因组数据进行去噪。

BMC Bioinformatics. 2012 Oct 31;13:283. doi: 10.1186/1471-2105-13-283.

RecountDB: a database of mapped and count corrected transcribed sequences.RecountDB：一个映射和计数校正转录序列的数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D1089-92. doi: 10.1093/nar/gkr1172. Epub 2011 Dec 1.

本文引用的文献

Reptile: representative tiling for short read error correction.爬行动物：简称短读错误纠正的代表性平铺。

Bioinformatics. 2010 Oct 15;26(20):2526-33. doi: 10.1093/bioinformatics/btq468. Epub 2010 Aug 16.

Recount: expectation maximization based error correction tool for next generation sequencing data.叙述：基于期望最大化的新一代测序数据纠错工具。

Genome Inform. 2009 Oct;23(1):189-201.

RazerS--fast read mapping with sensitivity control.RazerS——具有灵敏度控制的快速读取映射。

Genome Res. 2009 Sep;19(9):1646-54. doi: 10.1101/gr.088823.108. Epub 2009 Jul 10.

SHREC: a short-read error correction method.SHREC：一种短读长错误校正方法。

Bioinformatics. 2009 Sep 1;25(17):2157-63. doi: 10.1093/bioinformatics/btp379. Epub 2009 Jun 19.

Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.用于下一代测序中错误校正的基于频率的高效从头短读聚类

Genome Res. 2009 Jul;19(7):1309-15. doi: 10.1101/gr.089151.108. Epub 2009 May 13.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.短DNA序列与人类基因组的超快速且内存高效比对。

Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4.

ABySS: a parallel assembler for short read sequence data.ABySS：一种用于短读长序列数据的并行汇编器。

Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.

Finding optimal threshold for correction error reads in DNA assembling.寻找DNA组装中校正错误读数的最佳阈值。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S15. doi: 10.1186/1471-2105-10-S1-S15.

Next-generation DNA sequencing.下一代DNA测序

Nat Biotechnol. 2008 Oct;26(10):1135-45. doi: 10.1038/nbt1486.

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.来自高通量DNA测序的超短读长数据集存在大量偏差。

Nucleic Acids Res. 2008 Sep;36(16):e105. doi: 10.1093/nar/gkn425. Epub 2008 Jul 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

重复感知建模和短读错误纠正。

Repeat-aware modeling and correction of short read errors.

机构信息

出版信息

BACKGROUND

RESULTS

AVAILABILITY

CONCLUSIONS

背景

结果

可用性

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献