BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.
There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data.
We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task.
BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.
有许多程序可用于生成模拟全基因组鸟枪法测序序列。这些程序中的许多程序生成的数据都遵循预定义的模型,这限制了它们只能用于作者的原始意图。例如,许多模型假设读取长度遵循均匀或正态分布。其他程序根据实际测序数据生成模型,但仅限于来自单基因组研究的读取。据我们所知,没有程序允许用户根据来自宏基因组测序数据的经验导出信息生成遵循非参数读取长度分布和质量分布的模拟数据。
我们提出了 BEAR(用于人工读取的更好仿真),这是一种程序,它使用机器学习方法生成与经验导出分布紧密匹配的长度和质量值的读取。BEAR 可以仿真来自各种测序平台的读取,包括 Illumina、454 和 Ion Torrent。BEAR 需要的用户输入最少,因为它会自动根据用户提供的数据确定适当的参数设置。BEAR 还使用独特的方法从宏基因组数据本身推导出特定于运行的错误率,并提取有用的统计信息,例如质量错误模型。许多现有的仿真器都特定于特定的测序技术;然而,BEAR 并非如此受限。由于其灵活性,BEAR 特别适用于仿真 Ion Torrent 等技术的行为,对于这些技术,目前尚无专用的测序仿真器。BEAR 也是第一个自动化生成丰度的宏基因组测序仿真程序,这可能是一项艰巨的任务。
BEAR 可用于评估基因组学中的数据处理工具。它具有许多优于现有可比软件的优势,例如生成更真实的读取以及独立于测序技术,并且具有特别适用于宏基因组学工作的功能。