发现诱导测序错误的模体。

Discovering motifs that induce sequencing errors.

机构信息

Life Sciences Group, Centrum Wiskunde & Informatica, Amsterdam, Netherlands.

出版信息

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S1. doi: 10.1186/1471-2105-14-S5-S1. Epub 2013 Apr 10.

DOI:10.1186/1471-2105-14-S5-S1

PMID:23735080

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3622629/

Abstract

BACKGROUND

Elevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.

RESULTS

Here, for the first time, we describe a statistically rigorous framework for the discovery of motifs that induce sequencing errors. We apply our method to several datasets from Illumina GA IIx, HiSeq 2000, and MiSeq sequencers. We confirm previously known error-causing sequence contexts and report new more specific ones.

CONCLUSIONS

Checking for error-inducing motifs should be included into SNP calling pipelines to avoid false positives. To facilitate filtering of sets of putative SNPs, we provide tracks of error-prone genomic positions (in BED format).

AVAILABILITY

http://discovering-cse.googlecode.com.

摘要

背景

在使用下一代测序（NGS）的大多数当前研究中，单核苷酸多态性（SNP）检测是主要目标，而测序错误率高是最主要的障碍。除了常规处理的一般来源的错误外，某些碱基调用错误与特定的序列模式有关。以前没有描述过将序列模式与碱基调用错误相关联的统计学原理方法。现有的方法要么由于将错误与单个基因组位置而不是基序相关联而导致功率决定性损失，要么不能正确区分基序诱导和序列非特异性错误源。

结果

在这里，我们首次描述了一种用于发现诱导测序错误的基序的统计严格框架。我们将我们的方法应用于来自 Illumina GA IIx、HiSeq 2000 和 MiSeq 测序仪的几个数据集。我们确认了先前已知的引起错误的序列上下文，并报告了新的更具体的序列上下文。

结论

在 SNP 调用管道中应该检查诱导错误的基序，以避免假阳性。为了方便过滤假定 SNP 的集合，我们提供了易出错基因组位置的轨道（以 BED 格式）。

可用性

http://discovering-cse.googlecode.com。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ccb2/3622629/bb76e9a67b51/1471-2105-14-S5-S1-1.jpg

相似文献

Discovering motifs that induce sequencing errors.发现诱导测序错误的模体。

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S1. doi: 10.1186/1471-2105-14-S5-S1. Epub 2013 Apr 10.

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.Illumina错误概况：解析宏基因组测序数据中的精细尺度变异

BMC Bioinformatics. 2016 Mar 11;17:125. doi: 10.1186/s12859-016-0976-y.

Analysis of context-dependent errors for illumina sequencing.Illumina测序的上下文相关错误分析。

J Bioinform Comput Biol. 2012 Apr;10(2):1241005. doi: 10.1142/S0219720012410053.

Identification and correction of systematic error in high-throughput sequence data.高通量测序数据中系统误差的识别与校正。

BMC Bioinformatics. 2011 Nov 21;12:451. doi: 10.1186/1471-2105-12-451.

Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.Illumina HiSeq 和基因组分析仪系统生成的基因组高通量测序数据评估。

Genome Biol. 2011 Nov 8;12(11):R112. doi: 10.1186/gb-2011-12-11-r112.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

Whole-Genome Sequence Accuracy Is Improved by Replication in a Population of Mutagenized Sorghum.通过在诱变高粱群体中进行复制提高全基因组序列准确性。

G3 (Bethesda). 2018 Mar 2;8(3):1079-1094. doi: 10.1534/g3.117.300301.

A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers.三代测序平台的故事：Ion Torrent、Pacific Biosciences 和 Illumina MiSeq 测序仪的比较。

BMC Genomics. 2012 Jul 24;13:341. doi: 10.1186/1471-2164-13-341.

SNP calling by sequencing pooled samples.基于测序的混合样本 SNP 检测。

BMC Bioinformatics. 2012 Sep 20;13:239. doi: 10.1186/1471-2105-13-239.

SequencErr: measuring and suppressing sequencer errors in next-generation sequencing data.测序错误：测量和抑制下一代测序数据中的测序错误。

Genome Biol. 2021 Jan 25;22(1):37. doi: 10.1186/s13059-020-02254-2.

引用本文的文献

Swiftly identifying strongly unique k-mers.快速识别高度独特的k-mer序列。

Algorithms Mol Biol. 2025 Jul 13;20(1):13. doi: 10.1186/s13015-025-00286-6.

De-biasing microbiome sequencing data: bacterial morphology-based correction of extraction bias and correlates of chimera formation.微生物组测序数据的去偏倚：基于细菌形态学对提取偏差的校正及嵌合体形成的相关因素

Microbiome. 2025 Feb 4;13(1):38. doi: 10.1186/s40168-024-01998-4.

Exploring the impact of sequence context on errors in SNP genotype calling with whole genome sequencing data using AI-based autoencoder approach.使用基于人工智能的自动编码器方法，利用全基因组测序数据探索序列上下文对单核苷酸多态性（SNP）基因型分型错误的影响。

NAR Genom Bioinform. 2024 Sep 24;6(3):lqae131. doi: 10.1093/nargab/lqae131. eCollection 2024 Sep.

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs.追求测序完美：迈向更高精度和更低成本。

Genomics Proteomics Bioinformatics. 2024 Jul 3;22(2). doi: 10.1093/gpbjnl/qzae024.

Allele detection using -mer-based sequencing error profiles.使用基于k-mer的测序错误谱进行等位基因检测。

Bioinform Adv. 2023 Oct 20;3(1):vbad149. doi: 10.1093/bioadv/vbad149. eCollection 2023.

Tatajuba: exploring the distribution of homopolymer tracts.塔塔朱巴：探索同聚物片段的分布。

NAR Genom Bioinform. 2022 Feb 2;4(1):lqac003. doi: 10.1093/nargab/lqac003. eCollection 2022 Mar.

Sequencing error profiles of Illumina sequencing instruments.Illumina测序仪的测序错误图谱。

NAR Genom Bioinform. 2021 Mar 27;3(1):lqab019. doi: 10.1093/nargab/lqab019. eCollection 2021 Mar.

Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data.Needlestack：一种用于多样本下一代测序数据的超灵敏变异检测工具。

NAR Genom Bioinform. 2020 Jun;2(2):lqaa021. doi: 10.1093/nargab/lqaa021. Epub 2020 Apr 20.

Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery.迅猛龙：提高体细胞插入缺失发现的灵敏度并控制错误发现率。

Genome Biol. 2020 Apr 28;21(1):98. doi: 10.1186/s13059-020-01993-6.

Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies.应用下一代测序技术揭示病毒准种中的突变景观。

Virus Res. 2020 Jul 2;283:197963. doi: 10.1016/j.virusres.2020.197963. Epub 2020 Apr 9.

本文引用的文献

Identification and correction of systematic error in high-throughput sequence data.高通量测序数据中系统误差的识别与校正。

BMC Bioinformatics. 2011 Nov 21;12:451. doi: 10.1186/1471-2105-12-451.

Exome sequencing as a tool for Mendelian disease gene discovery.外显子组测序作为孟德尔疾病基因发现的工具。

Nat Rev Genet. 2011 Sep 27;12(11):745-55. doi: 10.1038/nrg3031.

Sequence-specific error profile of Illumina sequencers.Illumina 测序仪的序列特异性错误特征。

Nucleic Acids Res. 2011 Jul;39(13):e90. doi: 10.1093/nar/gkr344. Epub 2011 May 16.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.利用下一代 DNA 测序数据进行变异发现和基因分型的框架。

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

Integrative genomics viewer.整合基因组浏览器。

Nat Biotechnol. 2011 Jan;29(1):24-6. doi: 10.1038/nbt.1754.

A map of human genome variation from population-scale sequencing.人类基因组变异的图谱来自于基于人群的测序。

Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.基因组分析工具包：一种用于分析下一代 DNA 测序数据的 MapReduce 框架。

Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.

Fast and accurate long-read alignment with Burrows-Wheeler transform.基于 Burrows-Wheeler 变换的快速准确长读比对。

Bioinformatics. 2010 Mar 1;26(5):589-95. doi: 10.1093/bioinformatics/btp698. Epub 2010 Jan 15.

Sequencing technologies - the next generation.测序技术——下一代。

Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8.

Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.基因组 10K：获取 10000 种脊椎动物全基因组序列的提案。

J Hered. 2009 Nov-Dec;100(6):659-74. doi: 10.1093/jhered/esp086. Epub 2009 Nov 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

发现诱导测序错误的模体。

Discovering motifs that induce sequencing errors.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

AVAILABILITY

背景

结果

结论

可用性

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献