一个更好的宏基因组学序列读取模拟程序。

A better sequence-read simulator program for metagenomics.

出版信息

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4168713/

Abstract

BACKGROUND

There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data.

RESULTS

We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task.

CONCLUSIONS

BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.

摘要

背景

有许多程序可用于生成模拟全基因组鸟枪法测序序列。这些程序中的许多程序生成的数据都遵循预定义的模型，这限制了它们只能用于作者的原始意图。例如，许多模型假设读取长度遵循均匀或正态分布。其他程序根据实际测序数据生成模型，但仅限于来自单基因组研究的读取。据我们所知，没有程序允许用户根据来自宏基因组测序数据的经验导出信息生成遵循非参数读取长度分布和质量分布的模拟数据。

结果

我们提出了 BEAR（用于人工读取的更好仿真），这是一种程序，它使用机器学习方法生成与经验导出分布紧密匹配的长度和质量值的读取。BEAR 可以仿真来自各种测序平台的读取，包括 Illumina、454 和 Ion Torrent。BEAR 需要的用户输入最少，因为它会自动根据用户提供的数据确定适当的参数设置。BEAR 还使用独特的方法从宏基因组数据本身推导出特定于运行的错误率，并提取有用的统计信息，例如质量错误模型。许多现有的仿真器都特定于特定的测序技术；然而，BEAR 并非如此受限。由于其灵活性，BEAR 特别适用于仿真 Ion Torrent 等技术的行为，对于这些技术，目前尚无专用的测序仿真器。BEAR 也是第一个自动化生成丰度的宏基因组测序仿真程序，这可能是一项艰巨的任务。

结论

BEAR 可用于评估基因组学中的数据处理工具。它具有许多优于现有可比软件的优势，例如生成更真实的读取以及独立于测序技术，并且具有特别适用于宏基因组学工作的功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/76b9d062fdd4/1471-2105-15-S9-S14-1.jpg

相似文献

A better sequence-read simulator program for metagenomics.一个更好的宏基因组学序列读取模拟程序。

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.

GemSIM: general, error-model based simulator of next-generation sequencing data.GemSIM：新一代测序数据的通用、基于错误模型的模拟器。

BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.

Short clones or long clones? A simulation study on the use of paired reads in metagenomics.短克隆还是长克隆？宏基因组学中使用配对reads 的模拟研究。

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2105-11-S1-S12.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Unlocking short read sequencing for metagenomics.解锁宏基因组学的短读测序。

PLoS One. 2010 Jul 28;5(7):e11840. doi: 10.1371/journal.pone.0011840.

Assessment of metagenomic assembly using simulated next generation sequencing data.基于模拟下一代测序数据的宏基因组组装评估。

PLoS One. 2012;7(2):e31386. doi: 10.1371/journal.pone.0031386. Epub 2012 Feb 23.

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.SInC：一种准确且快速的基于错误模型的 SNP、Indel 和 CNV 模拟器，结合了用于短读序列数据的读取生成器。

BMC Bioinformatics. 2014 Feb 5;15:40. doi: 10.1186/1471-2105-15-40.

MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.环境宏基因组的MinION™纳米孔测序：一种合成方法。

Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007.

Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。

BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.

LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome.LRTK：一个与平台无关的工具包，用于人类基因组和宏基因组的连锁读长分析。

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae028.

引用本文的文献

Semisynthetic simulation for microbiome data analysis.用于微生物组数据分析的半合成模拟

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf051.

Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines.模拟高通量测序数据集：验证生物信息病原体检测流程的关键工具。

Biology (Basel). 2024 Sep 6;13(9):700. doi: 10.3390/biology13090700.

SWAMPy: simulating SARS-CoV-2 wastewater amplicon metagenomes.SWAMPy：模拟 SARS-CoV-2 废水扩增子宏基因组。

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae532.

Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis.Boquila：用于消除序列分析中读取核苷酸偏差的二代测序读段模拟器。

Turk J Biol. 2023 Feb 21;47(2):158-163. doi: 10.55730/1300-0152.2650. eCollection 2023.

SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples.SimFFPE 和 FilterFFPE：提高 FFPE 样本中的结构变异调用。

Gigascience. 2021 Sep 22;10(9). doi: 10.1093/gigascience/giab065.

Tamock: simulation of habitat-specific benchmark data in metagenomics.Tamock：宏基因组学中栖息地特异性基准数据的模拟。

BMC Bioinformatics. 2021 May 1;22(1):227. doi: 10.1186/s12859-021-04154-z.

ReSeq simulates realistic Illumina high-throughput sequencing data.ReSeq 模拟真实的 Illumina 高通量测序数据。

Genome Biol. 2021 Feb 19;22(1):67. doi: 10.1186/s13059-021-02265-7.

Biases in genome reconstruction from metagenomic data.宏基因组数据基因组重建中的偏差。

PeerJ. 2020 Oct 30;8:e10119. doi: 10.7717/peerj.10119. eCollection 2020.

Evolution of Multi-Resistance to Vancomycin, Daptomycin, and Linezolid in Methicillin-Resistant Causing Persistent Bacteremia.耐甲氧西林金黄色葡萄球菌中对万古霉素、达托霉素和利奈唑胺多重耐药性的演变导致持续性菌血症

Front Microbiol. 2020 Jul 7;11:1414. doi: 10.3389/fmicb.2020.01414. eCollection 2020.

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.SimuSCoP：基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。

BMC Bioinformatics. 2020 Jul 23;21(1):331. doi: 10.1186/s12859-020-03665-5.

本文引用的文献

RefSeq: an update on mammalian reference sequences.RefSeq：哺乳动物参考序列的更新。

Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.

Transcriptome sequence and plasmid copy number analysis of the brewery isolate Pediococcus claussenii ATCC BAA-344 T during growth in beer.在啤酒中生长过程中对啤酒酵母分离株 Claussenii Pediococcus 菌株 ATCC BAA-344 T 的转录组序列和质粒拷贝数进行分析。

PLoS One. 2013 Sep 6;8(9):e73627. doi: 10.1371/journal.pone.0073627. eCollection 2013.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.Assemblathon2：在三个脊椎动物物种中评估从头组装基因组方法。

Gigascience. 2013 Jul 22;2(1):10. doi: 10.1186/2047-217X-2-10.

Global distribution of a wild alga revealed by targeted metagenomics.靶向宏基因组学揭示野生藻类的全球分布

Curr Biol. 2012 Sep 11;22(17):R675-7. doi: 10.1016/j.cub.2012.07.054.

A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.一种用于检测宏基因组测序数据中错误的与平台无关的方法：DRISEE。

PLoS Comput Biol. 2012;8(6):e1002541. doi: 10.1371/journal.pcbi.1002541. Epub 2012 Jun 7.

Grinder: a versatile amplicon and shotgun sequence simulator.Grinder：一种通用的扩增子和鸟枪法测序模拟程序。

Nucleic Acids Res. 2012 Jul;40(12):e94. doi: 10.1093/nar/gks251. Epub 2012 Mar 19.

Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。

Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

GemSIM: general, error-model based simulator of next-generation sequencing data.GemSIM：新一代测序数据的通用、基于错误模型的模拟器。

BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.

Genome sequence of Lactobacillus rhamnosus ATCC 8530.鼠李糖乳杆菌 ATCC 8530 的基因组序列。

J Bacteriol. 2012 Feb;194(3):726. doi: 10.1128/JB.06430-11.

An efficient simulator of 454 data using configurable statistical models.一种使用可配置统计模型的454数据高效模拟器。

BMC Res Notes. 2011 Oct 26;4:449. doi: 10.1186/1756-0500-4-449.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一个更好的宏基因组学序列读取模拟程序。

A better sequence-read simulator program for metagenomics.

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献