SimuSCoP：基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.

机构信息

School of Information Engineering, Ningxia University, Yinchuan, 750021, China.

Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.

出版信息

BMC Bioinformatics. 2020 Jul 23;21(1):331. doi: 10.1186/s12859-020-03665-5.

DOI:10.1186/s12859-020-03665-5

PMID:32703148

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7379788/

Abstract

BACKGROUND

A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required.

RESULTS

Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools.

CONCLUSIONS

SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.

摘要

背景

已经开发了许多模拟器来模拟下一代测序数据，方法是合并已知的错误，例如碱基替换和插入缺失。然而，它们的实用性可能会因功能和运行时限制而降低。特别是，位置和基因组上下文信息没有有效地用于可靠地描述碱基替换模式，以及 Phred 质量分数的位置和上下文差异也没有得到充分研究。因此，非常需要一种更有效和高效的生物信息学工具。

结果

在这里，我们引入了一种新的工具 SimuSCoP，用于可靠地模拟复杂的 DNA 测序数据。全面探索和整合了 Illumina 测序数据中碱基替换模式和质量分数的统计行为，将其纳入模拟模型中，用于为不同应用可靠地模拟数据集。此外，SimuSCoP 采用了集成且易于使用的流水线，以方便复杂样本的端到端模拟，并通过实现工具以低内存消耗进行多线程运行来实现高运行时效率。这些功能使 SimuSCoP 在可靠性、功能、实用性和运行时效率方面都得到了实质性的提高。该工具在多个方面进行了全面评估，包括分布的一致性、基因组变异和复杂肿瘤样本的模拟，结果表明 SimuSCoP 优于现有工具。

结论

SimuSCoP 是一种新的生物信息学工具，它从真实测序数据中学习信息丰富的分布，并通过引入各种基因组变异来可靠地模拟复杂数据。我们相信，所提出的工作将为分析测序数据的下游生物信息学方法的新发展提供动力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d5f/7379788/df4b59780878/12859_2020_3665_Fig1_HTML.jpg

相似文献

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.SimuSCoP：基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。

BMC Bioinformatics. 2020 Jul 23;21(1):331. doi: 10.1186/s12859-020-03665-5.

pIRS: Profile-based Illumina pair-end reads simulator.pIRS：基于谱的 Illumina 双端读取模拟器。

Bioinformatics. 2012 Jun 1;28(11):1533-5. doi: 10.1093/bioinformatics/bts187. Epub 2012 Apr 15.

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM：一种用于下一代测序研究的基于Phred分数的基因型分型方法。

Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.

SVSR: A Program to Simulate Structural Variations and Generate Sequencing Reads for Multiple Platforms.SVSR：一个用于模拟结构变异并为多个平台生成测序读数的程序。

IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):1082-1091. doi: 10.1109/TCBB.2018.2876527. Epub 2018 Oct 17.

BSPAT: a fast online tool for DNA methylation co-occurrence pattern analysis based on high-throughput bisulfite sequencing data.BSPAT：一种基于高通量亚硫酸氢盐测序数据的用于DNA甲基化共现模式分析的快速在线工具。

BMC Bioinformatics. 2015 Jul 11;16:220. doi: 10.1186/s12859-015-0649-2.

NGSphy: phylogenomic simulation of next-generation sequencing data.NGSphy：下一代测序数据的系统发育模拟。

Bioinformatics. 2018 Jul 15;34(14):2506-2507. doi: 10.1093/bioinformatics/bty146.

Alignment-free clustering of UMI tagged DNA molecules.无比对聚类分析 UMI 标签化 DNA 分子。

Bioinformatics. 2019 Jun 1;35(11):1829-1836. doi: 10.1093/bioinformatics/bty888.

Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling.通过逻辑回归和稀疏建模估计Illumina碱基识别的Phred质量分数。

BMC Bioinformatics. 2017 Jul 11;18(1):335. doi: 10.1186/s12859-017-1743-4.

SCSsim: an integrated tool for simulating single-cell genome sequencing data.SCSsim：一种用于模拟单细胞基因组测序数据的集成工具。

Bioinformatics. 2020 Feb 15;36(4):1281-1282. doi: 10.1093/bioinformatics/btz713.

Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models.根据经验性突变和测序模型模拟下一代测序数据集。

PLoS One. 2016 Nov 28;11(11):e0167047. doi: 10.1371/journal.pone.0167047. eCollection 2016.

引用本文的文献

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis.Boquila：用于消除序列分析中读取核苷酸偏差的二代测序读段模拟器。

Turk J Biol. 2023 Feb 21;47(2):158-163. doi: 10.55730/1300-0152.2650. eCollection 2023.

Evaluation of computational phage detection tools for metagenomic datasets.用于宏基因组数据集的计算噬菌体检测工具评估

Front Microbiol. 2023 Jan 25;14:1078760. doi: 10.3389/fmicb.2023.1078760. eCollection 2023.

Prevalence and whole genome phylogenetic analysis reveal genetic relatedness between antibiotic resistance Salmonella in hatchlings and older chickens from farms in Nigeria.在尼日利亚农场的孵化小鸡和成年鸡中，抗生素耐药性沙门氏菌的流行情况和全基因组系统进化分析显示出遗传关联性。

Poult Sci. 2023 Mar;102(3):102427. doi: 10.1016/j.psj.2022.102427. Epub 2022 Dec 15.

Different structural variant prediction tools yield considerably different results in Caenorhabditis elegans.不同的结构变异预测工具在秀丽隐杆线虫中产生了相当不同的结果。

PLoS One. 2022 Dec 30;17(12):e0278424. doi: 10.1371/journal.pone.0278424. eCollection 2022.

Editorial: Unsupervised Learning Models for Unlabeled Genomic, Transcriptomic & Proteomic Data.社论：用于未标记基因组、转录组和蛋白质组数据的无监督学习模型

Front Genet. 2021 Nov 11;12:781698. doi: 10.3389/fgene.2021.781698. eCollection 2021.

SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples.SimFFPE 和 FilterFFPE：提高 FFPE 样本中的结构变异调用。

Gigascience. 2021 Sep 22;10(9). doi: 10.1093/gigascience/giab065.

本文引用的文献

Simulating Illumina metagenomic data with InSilicoSeq.用 InSilicoSeq 模拟 Illumina 宏基因组数据。

Bioinformatics. 2019 Feb 1;35(3):521-522. doi: 10.1093/bioinformatics/bty630.

MERIT reveals the impact of genomic context on sequencing error rate in ultra-deep applications.MERIT 揭示了基因组背景对超高深度应用测序错误率的影响。

BMC Bioinformatics. 2018 Jun 8;19(1):219. doi: 10.1186/s12859-018-2223-1.

Pysim-sv: a package for simulating structural variation data with GC-biases.Pysim-sv：一个用于模拟具有GC偏差的结构变异数据的软件包。

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):53. doi: 10.1186/s12859-017-1464-8.

CLImAT-HET: detecting subclonal copy number alterations and loss of heterozygosity in heterogeneous tumor samples from whole-genome sequencing data.CLImAT-HET：从全基因组测序数据中检测异质性肿瘤样本中的亚克隆拷贝数改变和杂合性缺失

BMC Med Genomics. 2017 Mar 15;10(1):15. doi: 10.1186/s12920-017-0255-4.

Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models.根据经验性突变和测序模型模拟下一代测序数据集。

PLoS One. 2016 Nov 28;11(11):e0167047. doi: 10.1371/journal.pone.0167047. eCollection 2016.

IntSIM: An Integrated Simulator of Next-Generation Sequencing Data.IntSIM：下一代测序数据集成模拟器

IEEE Trans Biomed Eng. 2017 Feb;64(2):441-451. doi: 10.1109/TBME.2016.2560939. Epub 2016 Apr 29.

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.Illumina错误概况：解析宏基因组测序数据中的精细尺度变异

BMC Bioinformatics. 2016 Mar 11;17:125. doi: 10.1186/s12859-016-0976-y.

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.去噪DNA深度测序数据——高通量测序错误及其校正

Brief Bioinform. 2016 Jan;17(1):154-79. doi: 10.1093/bib/bbv029. Epub 2015 May 29.

SCNVSim: somatic copy number variation and structure variation simulator.SCNVSim：体细胞拷贝数变异与结构变异模拟器

BMC Bioinformatics. 2015 Feb 28;16(1):66. doi: 10.1186/s12859-015-0502-7.

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.深入了解Illumina MiSeq平台进行扩增子测序时的偏差和测序错误。

Nucleic Acids Res. 2015 Mar 31;43(6):e37. doi: 10.1093/nar/gku1341. Epub 2015 Jan 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SimuSCoP：基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献