Centre for Marine Bio-Innovation and School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW, Australia.
BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.
GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects.
We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read.The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs.
Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
GemSIM,即基于通用错误模型的模拟器,是一种新一代测序模拟器,能够为任何与通用格式 SAM 和 FASTQ 兼容的测序技术(包括 Illumina 和 Roche/454)生成单端或双端读取。GemSIM 创建并使用经验导出的、基于序列上下文的错误模型,以真实地模拟单个测序运行和/或技术。还使用经验片段长度和质量分数分布。可以从一个或多个基因组或单倍型集中提取读取,从而方便模拟深度测序、宏基因组和重测序项目。
我们通过从两个不同的 Illumina 测序运行和一个 Roche/454 运行中提取错误模型,并比较和对比每个运行的结果错误分布,展示了 GemSIM 的价值。总体错误率在个体 Illumina 运行之间、每个对的第一和第二读取之间以及 Illumina 和 Roche/454 技术的数据集之间都有很大差异。在 Roche/454 中插入/缺失(indels)比 Illumina 更频繁,并且两种技术在每个读取结束时都出现错误率增加的情况。通过分析细菌单倍型混合物的模拟测序数据,研究了这些不同分布对低频 SNP 调用准确性的影响。一般来说,使用 VarScan 进行 SNP 调用仅对频率>3%的 SNP 准确,而不考虑用于模拟数据的错误模型。不同分布之间的差异与 VarScan 的“最小平均质量”参数强烈相互作用,导致不同测序运行的最佳设置不同。
下一代测序具有评估遗传多样性的前所未有的潜力,但是分析很复杂,因为即使在同一技术的不同运行之间,错误分布也可能明显不同。通过使用 GemSIM 进行模拟,可以帮助解决此问题,提供对单个测序运行错误分布的深入了解,并允许研究人员评估这些错误对下游数据分析的影响。