• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个更好的宏基因组学序列读取模拟程序。

A better sequence-read simulator program for metagenomics.

出版信息

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.

DOI:10.1186/1471-2105-15-S9-S14
PMID:25253095
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4168713/
Abstract

BACKGROUND

There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data.

RESULTS

We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task.

CONCLUSIONS

BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.

摘要

背景

有许多程序可用于生成模拟全基因组鸟枪法测序序列。这些程序中的许多程序生成的数据都遵循预定义的模型,这限制了它们只能用于作者的原始意图。例如,许多模型假设读取长度遵循均匀或正态分布。其他程序根据实际测序数据生成模型,但仅限于来自单基因组研究的读取。据我们所知,没有程序允许用户根据来自宏基因组测序数据的经验导出信息生成遵循非参数读取长度分布和质量分布的模拟数据。

结果

我们提出了 BEAR(用于人工读取的更好仿真),这是一种程序,它使用机器学习方法生成与经验导出分布紧密匹配的长度和质量值的读取。BEAR 可以仿真来自各种测序平台的读取,包括 Illumina、454 和 Ion Torrent。BEAR 需要的用户输入最少,因为它会自动根据用户提供的数据确定适当的参数设置。BEAR 还使用独特的方法从宏基因组数据本身推导出特定于运行的错误率,并提取有用的统计信息,例如质量错误模型。许多现有的仿真器都特定于特定的测序技术;然而,BEAR 并非如此受限。由于其灵活性,BEAR 特别适用于仿真 Ion Torrent 等技术的行为,对于这些技术,目前尚无专用的测序仿真器。BEAR 也是第一个自动化生成丰度的宏基因组测序仿真程序,这可能是一项艰巨的任务。

结论

BEAR 可用于评估基因组学中的数据处理工具。它具有许多优于现有可比软件的优势,例如生成更真实的读取以及独立于测序技术,并且具有特别适用于宏基因组学工作的功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/01a6decdd947/1471-2105-15-S9-S14-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/76b9d062fdd4/1471-2105-15-S9-S14-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/a0431a82b606/1471-2105-15-S9-S14-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/56d833d2ada7/1471-2105-15-S9-S14-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/01a6decdd947/1471-2105-15-S9-S14-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/76b9d062fdd4/1471-2105-15-S9-S14-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/a0431a82b606/1471-2105-15-S9-S14-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/56d833d2ada7/1471-2105-15-S9-S14-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a47/4168713/01a6decdd947/1471-2105-15-S9-S14-4.jpg

相似文献

1
A better sequence-read simulator program for metagenomics.一个更好的宏基因组学序列读取模拟程序。
BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-15-S9-S14. Epub 2014 Sep 10.
2
GemSIM: general, error-model based simulator of next-generation sequencing data.GemSIM:新一代测序数据的通用、基于错误模型的模拟器。
BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.
3
Short clones or long clones? A simulation study on the use of paired reads in metagenomics.短克隆还是长克隆?宏基因组学中使用配对reads 的模拟研究。
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2105-11-S1-S12.
4
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
5
Unlocking short read sequencing for metagenomics.解锁宏基因组学的短读测序。
PLoS One. 2010 Jul 28;5(7):e11840. doi: 10.1371/journal.pone.0011840.
6
Assessment of metagenomic assembly using simulated next generation sequencing data.基于模拟下一代测序数据的宏基因组组装评估。
PLoS One. 2012;7(2):e31386. doi: 10.1371/journal.pone.0031386. Epub 2012 Feb 23.
7
SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.SInC:一种准确且快速的基于错误模型的 SNP、Indel 和 CNV 模拟器,结合了用于短读序列数据的读取生成器。
BMC Bioinformatics. 2014 Feb 5;15:40. doi: 10.1186/1471-2105-15-40.
8
MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.环境宏基因组的MinION™纳米孔测序:一种合成方法。
Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007.
9
Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。
BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.
10
LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome.LRTK:一个与平台无关的工具包,用于人类基因组和宏基因组的连锁读长分析。
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae028.

引用本文的文献

1
Semisynthetic simulation for microbiome data analysis.用于微生物组数据分析的半合成模拟
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf051.
2
Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines.模拟高通量测序数据集:验证生物信息病原体检测流程的关键工具。
Biology (Basel). 2024 Sep 6;13(9):700. doi: 10.3390/biology13090700.
3
SWAMPy: simulating SARS-CoV-2 wastewater amplicon metagenomes.SWAMPy:模拟 SARS-CoV-2 废水扩增子宏基因组。

本文引用的文献

1
RefSeq: an update on mammalian reference sequences.RefSeq:哺乳动物参考序列的更新。
Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.
2
Transcriptome sequence and plasmid copy number analysis of the brewery isolate Pediococcus claussenii ATCC BAA-344 T during growth in beer.在啤酒中生长过程中对啤酒酵母分离株 Claussenii Pediococcus 菌株 ATCC BAA-344 T 的转录组序列和质粒拷贝数进行分析。
PLoS One. 2013 Sep 6;8(9):e73627. doi: 10.1371/journal.pone.0073627. eCollection 2013.
3
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae532.
4
Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis.Boquila:用于消除序列分析中读取核苷酸偏差的二代测序读段模拟器。
Turk J Biol. 2023 Feb 21;47(2):158-163. doi: 10.55730/1300-0152.2650. eCollection 2023.
5
SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples.SimFFPE 和 FilterFFPE:提高 FFPE 样本中的结构变异调用。
Gigascience. 2021 Sep 22;10(9). doi: 10.1093/gigascience/giab065.
6
Tamock: simulation of habitat-specific benchmark data in metagenomics.Tamock:宏基因组学中栖息地特异性基准数据的模拟。
BMC Bioinformatics. 2021 May 1;22(1):227. doi: 10.1186/s12859-021-04154-z.
7
ReSeq simulates realistic Illumina high-throughput sequencing data.ReSeq 模拟真实的 Illumina 高通量测序数据。
Genome Biol. 2021 Feb 19;22(1):67. doi: 10.1186/s13059-021-02265-7.
8
Biases in genome reconstruction from metagenomic data.宏基因组数据基因组重建中的偏差。
PeerJ. 2020 Oct 30;8:e10119. doi: 10.7717/peerj.10119. eCollection 2020.
9
Evolution of Multi-Resistance to Vancomycin, Daptomycin, and Linezolid in Methicillin-Resistant Causing Persistent Bacteremia.耐甲氧西林金黄色葡萄球菌中对万古霉素、达托霉素和利奈唑胺多重耐药性的演变导致持续性菌血症
Front Microbiol. 2020 Jul 7;11:1414. doi: 10.3389/fmicb.2020.01414. eCollection 2020.
10
SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.SimuSCoP:基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。
BMC Bioinformatics. 2020 Jul 23;21(1):331. doi: 10.1186/s12859-020-03665-5.
Assemblathon2:在三个脊椎动物物种中评估从头组装基因组方法。
Gigascience. 2013 Jul 22;2(1):10. doi: 10.1186/2047-217X-2-10.
4
Global distribution of a wild alga revealed by targeted metagenomics.靶向宏基因组学揭示野生藻类的全球分布
Curr Biol. 2012 Sep 11;22(17):R675-7. doi: 10.1016/j.cub.2012.07.054.
5
A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.一种用于检测宏基因组测序数据中错误的与平台无关的方法:DRISEE。
PLoS Comput Biol. 2012;8(6):e1002541. doi: 10.1371/journal.pcbi.1002541. Epub 2012 Jun 7.
6
Grinder: a versatile amplicon and shotgun sequence simulator.Grinder:一种通用的扩增子和鸟枪法测序模拟程序。
Nucleic Acids Res. 2012 Jul;40(12):e94. doi: 10.1093/nar/gks251. Epub 2012 Mar 19.
7
Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。
Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.
8
GemSIM: general, error-model based simulator of next-generation sequencing data.GemSIM:新一代测序数据的通用、基于错误模型的模拟器。
BMC Genomics. 2012 Feb 15;13:74. doi: 10.1186/1471-2164-13-74.
9
Genome sequence of Lactobacillus rhamnosus ATCC 8530.鼠李糖乳杆菌 ATCC 8530 的基因组序列。
J Bacteriol. 2012 Feb;194(3):726. doi: 10.1128/JB.06430-11.
10
An efficient simulator of 454 data using configurable statistical models.一种使用可配置统计模型的454数据高效模拟器。
BMC Res Notes. 2011 Oct 26;4:449. doi: 10.1186/1756-0500-4-449.