GemSIM：新一代测序数据的通用、基于错误模型的模拟器。

GemSIM: general, error-model based simulator of next-generation sequencing data.

机构信息

Centre for Marine Bio-Innovation and School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW, Australia.

出版信息

BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.

DOI:10.1186/1471-2164-13-74

PMID:22336055

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3305602/

Abstract

BACKGROUND

GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects.

RESULTS

We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read.The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs.

CONCLUSIONS

Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.

摘要

背景

GemSIM，即基于通用错误模型的模拟器，是一种新一代测序模拟器，能够为任何与通用格式 SAM 和 FASTQ 兼容的测序技术（包括 Illumina 和 Roche/454）生成单端或双端读取。GemSIM 创建并使用经验导出的、基于序列上下文的错误模型，以真实地模拟单个测序运行和/或技术。还使用经验片段长度和质量分数分布。可以从一个或多个基因组或单倍型集中提取读取，从而方便模拟深度测序、宏基因组和重测序项目。

结果

我们通过从两个不同的 Illumina 测序运行和一个 Roche/454 运行中提取错误模型，并比较和对比每个运行的结果错误分布，展示了 GemSIM 的价值。总体错误率在个体 Illumina 运行之间、每个对的第一和第二读取之间以及 Illumina 和 Roche/454 技术的数据集之间都有很大差异。在 Roche/454 中插入/缺失（indels）比 Illumina 更频繁，并且两种技术在每个读取结束时都出现错误率增加的情况。通过分析细菌单倍型混合物的模拟测序数据，研究了这些不同分布对低频 SNP 调用准确性的影响。一般来说，使用 VarScan 进行 SNP 调用仅对频率>3%的 SNP 准确，而不考虑用于模拟数据的错误模型。不同分布之间的差异与 VarScan 的“最小平均质量”参数强烈相互作用，导致不同测序运行的最佳设置不同。

结论

下一代测序具有评估遗传多样性的前所未有的潜力，但是分析很复杂，因为即使在同一技术的不同运行之间，错误分布也可能明显不同。通过使用 GemSIM 进行模拟，可以帮助解决此问题，提供对单个测序运行错误分布的深入了解，并允许研究人员评估这些错误对下游数据分析的影响。

相似文献

GemSIM: general, error-model based simulator of next-generation sequencing data.

BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.

A better sequence-read simulator program for metagenomics.

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.

pIRS: Profile-based Illumina pair-end reads simulator.

Bioinformatics. 2012 Jun 1;28(11):1533-5. doi: 10.1093/bioinformatics/bts187. Epub 2012 Apr 15.

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

BMC Bioinformatics. 2014 Feb 5;15:40. doi: 10.1186/1471-2105-15-40.

VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Bioinformatics. 2009 Sep 1;25(17):2283-5. doi: 10.1093/bioinformatics/btp373. Epub 2009 Jun 19.

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.

BMC Bioinformatics. 2016 Mar 11;17:125. doi: 10.1186/s12859-016-0976-y.

Long fragments achieve lower base quality in Illumina paired-end sequencing.

Sci Rep. 2019 Feb 27;9(1):2856. doi: 10.1038/s41598-019-39076-7.

Dindel: accurate indel calls from short-read data.

Genome Res. 2011 Jun;21(6):961-73. doi: 10.1101/gr.112326.110. Epub 2010 Oct 27.

Short clones or long clones? A simulation study on the use of paired reads in metagenomics.

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2105-11-S1-S12.

Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.

Genome Res. 2009 Jul;19(7):1309-15. doi: 10.1101/gr.089151.108. Epub 2009 May 13.

引用本文的文献

SWAMPy: simulating SARS-CoV-2 wastewater amplicon metagenomes.

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae532.

Phylogenomic and genomic analysis reveals unique and shared genetic signatures of complex species.

Microb Genom. 2024 Jul;10(7). doi: 10.1099/mgen.0.001266.

Simulation of nanopore sequencing signal data with tunable parameters.

Genome Res. 2024 Jun 25;34(5):778-783. doi: 10.1101/gr.278730.123.

A 14-Day Double-Blind, Randomized, Controlled Crossover Intervention Study with Anti-Bacterial Benzyl Isothiocyanate from Nasturtium () on Human Gut Microbiome and Host Defense.

Nutrients. 2024 Jan 26;16(3):373. doi: 10.3390/nu16030373.

Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis.

Turk J Biol. 2023 Feb 21;47(2):158-163. doi: 10.55730/1300-0152.2650. eCollection 2023.

Identification of representative species-specific genes for abundance measurements.

Bioinform Adv. 2023 May 8;3(1):vbad060. doi: 10.1093/bioadv/vbad060. eCollection 2023.

Evaluation of computational phage detection tools for metagenomic datasets.

Front Microbiol. 2023 Jan 25;14:1078760. doi: 10.3389/fmicb.2023.1078760. eCollection 2023.

Genome sequence assembly algorithms and misassembly identification methods.

Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23.

J-SPACE: a Julia package for the simulation of spatial models of cancer evolution and of sequencing experiments.

BMC Bioinformatics. 2022 Jul 8;23(1):269. doi: 10.1186/s12859-022-04779-8.

SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples.

Gigascience. 2021 Sep 22;10(9). doi: 10.1093/gigascience/giab065.

本文引用的文献

Sequential bottlenecks drive viral evolution in early acute hepatitis C virus infection.

PLoS Pathog. 2011 Sep;7(9):e1002243. doi: 10.1371/journal.ppat.1002243. Epub 2011 Sep 1.

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

BMC Genomics. 2011 May 19;12:245. doi: 10.1186/1471-2164-12-245.

Genotype and SNP calling from next-generation sequencing data.

Nat Rev Genet. 2011 Jun;12(6):443-51. doi: 10.1038/nrg2986.

Sequence-specific error profile of Illumina sequencers.

Nucleic Acids Res. 2011 Jul;39(13):e90. doi: 10.1093/nar/gkr344. Epub 2011 May 16.

Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim.

Bioinformatics. 2010 Sep 15;26(18):i420-5. doi: 10.1093/bioinformatics/btq365.

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res. 2010 Apr;38(6):1767-71. doi: 10.1093/nar/gkp1137. Epub 2009 Dec 16.

Sequencing technologies - the next generation.

Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8.

VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Bioinformatics. 2009 Sep 1;25(17):2283-5. doi: 10.1093/bioinformatics/btp373. Epub 2009 Jun 19.

The Sequence Alignment/Map format and SAMtools.

Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.

Accurate whole human genome sequencing using reversible terminator chemistry.

Nature. 2008 Nov 6;456(7218):53-9. doi: 10.1038/nature07517.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GemSIM：新一代测序数据的通用、基于错误模型的模拟器。

GemSIM: general, error-model based simulator of next-generation sequencing data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献