高效的病毒扩增子下一代测序错误校正。

Efficient error correction for next-generation sequencing of viral amplicons.

机构信息

Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA.

出版信息

BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S6. doi: 10.1186/1471-2105-13-S10-S6.

DOI:10.1186/1471-2105-13-S10-S6

PMID:22759430

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3382444/

Abstract

BACKGROUND

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

RESULTS

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

CONCLUSIONS

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

摘要

背景

下一代测序技术允许分析来自感染患者的前所未有的大量病毒序列变体，为了解病毒进化、耐药性和免疫逃逸提供了新的机会。然而，批量测序容易出错。因此，生成的数据需要进行错误识别和纠正。迄今为止，大多数纠错方法都不是针对扩增子分析进行优化的，并且假设错误率是随机分布的。最近对使用 454 测序获得的扩增子序列的质量评估表明，错误率与同源聚合物的存在和大小、序列中的位置以及扩增子的长度密切相关。所有这些参数都与序列密切相关，应该包含在为扩增子测序设计的纠错算法的校准中。

结果

在本文中，我们提出了两种针对病毒扩增子的新的高效纠错算法：（i）基于 k-mer 的纠错（KEC）和（ii）经验频率阈值（ET）。为了评估它们在通过 454 测序获得的已知序列的扩增子的 24 个实验数据集上的相对性能，将这两种算法与之前发表的聚类算法（SHORAH）进行了比较。所有三种算法在找到真实单倍型方面都具有相似的准确性。然而，KEC 和 ET 在去除假单倍型和估计真实单倍型的频率方面明显比 SHORAH 更有效。

结论

KEC 和 ET 这两种算法都非常适合从异源病毒的 454 测序扩增子中快速恢复无错误的单倍型。用于测试这些算法的数据集的实现可在以下网址获得：http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4615/3382444/bb61b7ce6a1a/1471-2105-13-S10-S6-1.jpg

相似文献

Efficient error correction for next-generation sequencing of viral amplicons.

BMC Bioinformatics. 2012 Jun 25;13 Suppl 10(Suppl 10):S6. doi: 10.1186/1471-2105-13-S10-S6.

Computational framework for next-generation sequencing of heterogeneous viral populations using combinatorial pooling.

Bioinformatics. 2015 Mar 1;31(5):682-90. doi: 10.1093/bioinformatics/btu726. Epub 2014 Oct 29.

Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows.

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S2. doi: 10.1186/1471-2105-14-S9-S2. Epub 2013 Jun 28.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

Clustering of circular consensus sequences: accurate error correction and assembly of single molecule real-time reads from multiplexed amplicon libraries.

BMC Bioinformatics. 2018 Aug 20;19(1):302. doi: 10.1186/s12859-018-2293-0.

Benchmarking of computational error-correction methods for next-generation sequencing data.

Genome Biol. 2020 Mar 17;21(1):71. doi: 10.1186/s13059-020-01988-3.

NoDe: a fast error-correction algorithm for pyrosequencing amplicon reads.

BMC Bioinformatics. 2015 Mar 15;16(1):88. doi: 10.1186/s12859-015-0520-5.

Inferring viral quasispecies spectra from 454 pyrosequencing reads.

BMC Bioinformatics. 2011;12 Suppl 6(Suppl 6):S1. doi: 10.1186/1471-2105-12-S6-S1. Epub 2011 Jul 28.

Removing noise from pyrosequenced amplicons.

BMC Bioinformatics. 2011 Jan 28;12:38. doi: 10.1186/1471-2105-12-38.

引用本文的文献

K-Mer Spectrum-Based Error Correction Algorithm for Next-Generation Sequencing Data.

Comput Intell Neurosci. 2022 Jul 14;2022:8077664. doi: 10.1155/2022/8077664. eCollection 2022.

Evaluating supervised and unsupervised background noise correction in human gut microbiome data.

PLoS Comput Biol. 2022 Feb 7;18(2):e1009838. doi: 10.1371/journal.pcbi.1009838. eCollection 2022 Feb.

Ultradeep Pyrosequencing of Hepatitis C Virus to Define Evolutionary Phenotypes.

Bio Protoc. 2017 May 20;7(10):e2284. doi: 10.21769/BioProtoc.2284.

Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction.

Nucleic Acids Res. 2021 Sep 27;49(17):e102. doi: 10.1093/nar/gkab576.

Quantitative differences between intra-host HCV populations from persons with recently established and persistent infections.

Virus Evol. 2020 Dec 30;7(1):veaa103. doi: 10.1093/ve/veaa103. eCollection 2021 Jan.

Analysis of heterogeneous genomic samples using image normalization and machine learning.

BMC Genomics. 2020 Dec 21;21(Suppl 6):405. doi: 10.1186/s12864-020-6661-6.

Epidemiological data analysis of viral quasispecies in the next-generation sequencing era.

Brief Bioinform. 2021 Jan 18;22(1):96-108. doi: 10.1093/bib/bbaa101.

A Phylogenetic Analysis of Hepatitis C Virus Transmission, Relapse, and Reinfection Among People Who Inject Drugs Receiving Opioid Agonist Therapy.

J Infect Dis. 2020 Jul 6;222(3):488-498. doi: 10.1093/infdis/jiaa100.

Barcode identification for single cell genomics.

BMC Bioinformatics. 2019 Jan 17;20(1):32. doi: 10.1186/s12859-019-2612-0.

A large HCV transmission network enabled a fast-growing HIV outbreak in rural Indiana, 2015.

EBioMedicine. 2018 Nov;37:374-381. doi: 10.1016/j.ebiom.2018.10.007. Epub 2018 Nov 15.

本文引用的文献

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

BMC Genomics. 2011 May 19;12:245. doi: 10.1186/1471-2164-12-245.

ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data.

BMC Bioinformatics. 2011 Apr 26;12:119. doi: 10.1186/1471-2105-12-119.

Correcting errors in short reads by multiple alignments.

Bioinformatics. 2011 Jun 1;27(11):1455-61. doi: 10.1093/bioinformatics/btr170. Epub 2011 Apr 5.

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies.

Nucleic Acids Res. 2010 Nov;38(21):7400-9. doi: 10.1093/nar/gkq655. Epub 2010 Jul 29.

Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction.

J Comput Biol. 2010 Mar;17(3):417-28. doi: 10.1089/cmb.2009.0164.

Hepatitis C virus transmission bottlenecks analyzed by deep sequencing.

J Virol. 2010 Jun;84(12):6218-28. doi: 10.1128/JVI.02271-09. Epub 2010 Apr 7.

Accurate determination of microbial diversity from 454 pyrosequencing data.

Nat Methods. 2009 Sep;6(9):639-41. doi: 10.1038/nmeth.1361. Epub 2009 Aug 9.

De novo fragment assembly with short mate-paired reads: Does the read length matter?

Genome Res. 2009 Feb;19(2):336-46. doi: 10.1101/gr.079053.108. Epub 2008 Dec 3.

End-point limiting-dilution real-time PCR assay for evaluation of hepatitis C virus quasispecies in serum: performance under optimal and suboptimal conditions.

J Virol Methods. 2008 Aug;151(2):217-224. doi: 10.1016/j.jviromet.2008.05.005. Epub 2008 Jun 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

高效的病毒扩增子下一代测序错误校正。

Efficient error correction for next-generation sequencing of viral amplicons.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献