TreeToReads - 一种用于从系统发育树模拟原始读段的流程。

TreeToReads - a pipeline for simulating raw reads from phylogenies.

作者信息

McTavish Emily Jane, Pettengill James, Davis Steven, Rand Hugh, Strain Errol, Allard Marc, Timme Ruth E

机构信息

University of California, Merced, Merced, CA, USA.

University of Kansas, Lawrence, RS, USA.

出版信息

BMC Bioinformatics. 2017 Mar 20;18(1):178. doi: 10.1186/s12859-017-1592-1.

DOI:10.1186/s12859-017-1592-1

PMID:28320310

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5359950/

Abstract

BACKGROUND

Using phylogenomic analysis tools for tracking pathogens has become standard practice in academia, public health agencies, and large industries. Using the same raw read genomic data as input, there are several different approaches being used to infer phylogenetic tree. These include many different SNP pipelines, wgMLST approaches, k-mer algorithms, whole genome alignment and others; each of these has advantages and disadvantages, some have been extensively validated, some are faster, some have higher resolution. A few of these analysis approaches are well-integrated into the regulatory process of US Federal agencies (e.g. the FDA's SNP pipeline for tracking foodborne pathogens). However, despite extensive validation on benchmark datasets and comparison with other pipelines, we lack methods for fully exploring the effects of multiple parameter values in each pipeline that can potentially have an effect on whether the correct phylogenetic tree is recovered.

RESULTS

To resolve this problem, we offer a program, TreeToReads, which can generate raw read data from mutated genomes simulated under a known phylogeny. This simulation pipeline allows direct comparisons of simulated and observed data in a controlled environment. At each step of these simulations, researchers can vary parameters of interest (e.g., input tree topology, amount of sequence divergence, rate of indels, read coverage, distance of reference genome, etc) to assess the effects of various parameter values on correctly calling SNPs and reconstructing an accurate tree.

CONCLUSIONS

Such critical assessments of the accuracy and robustness of analytical pipelines are essential to progress in both research and applied settings.

摘要

背景

使用系统发育基因组分析工具追踪病原体已成为学术界、公共卫生机构和大型企业的标准做法。使用相同的原始读取基因组数据作为输入，有几种不同的方法用于推断系统发育树。这些方法包括许多不同的单核苷酸多态性（SNP）管道、全基因组多位点序列分型（wgMLST）方法、k-mer算法、全基因组比对等；每种方法都有优缺点，有些已得到广泛验证，有些速度更快，有些分辨率更高。其中一些分析方法已很好地融入美国联邦机构的监管流程（例如，美国食品药品监督管理局（FDA）用于追踪食源性病原体的SNP管道）。然而，尽管在基准数据集上进行了广泛验证并与其他管道进行了比较，但我们缺乏全面探索每个管道中多个参数值的影响的方法，这些参数值可能会影响是否能恢复正确的系统发育树。

结果

为了解决这个问题，我们提供了一个程序TreeToReads，它可以从在已知系统发育下模拟的突变基因组中生成原始读取数据。这个模拟管道允许在可控环境中直接比较模拟数据和观测数据。在这些模拟的每个步骤中，研究人员可以改变感兴趣的参数（例如，输入树拓扑结构、序列分歧量、插入缺失率、读取覆盖率、参考基因组距离等），以评估各种参数值对正确调用SNP和重建准确树的影响。

结论

对分析管道的准确性和稳健性进行此类关键评估对于研究和应用环境的进展至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40dc/5359950/441b4a06a278/12859_2017_1592_Fig1_HTML.jpg

相似文献

TreeToReads - a pipeline for simulating raw reads from phylogenies.TreeToReads - 一种用于从系统发育树模拟原始读段的流程。

BMC Bioinformatics. 2017 Mar 20;18(1):178. doi: 10.1186/s12859-017-1592-1.

Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.用于系统发育基因组学流程验证的基准数据集，在食源性病原体监测中的应用。

PeerJ. 2017 Oct 6;5:e3893. doi: 10.7717/peerj.3893. eCollection 2017.

Automated reconstruction of whole-genome phylogenies from short-sequence reads.从短序列读段自动重建全基因组系统发育树。

Mol Biol Evol. 2014 May;31(5):1077-88. doi: 10.1093/molbev/msu088. Epub 2014 Mar 5.

Comparative Analysis of Tools and Approaches for Source Tracking in a Food Facility Using Whole-Genome Sequence Data.使用全基因组序列数据对食品设施中源追踪工具和方法的比较分析

Front Microbiol. 2019 May 9;10:947. doi: 10.3389/fmicb.2019.00947. eCollection 2019.

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data.一种从下一代测序数据中重建系统发育树的无需组装和比对的方法。

BMC Genomics. 2015 Jul 14;16(1):522. doi: 10.1186/s12864-015-1647-5.

A performance study of the impact of recombination on species tree analysis.关于重组对物种树分析影响的性能研究。

BMC Genomics. 2016 Nov 11;17(Suppl 10):785. doi: 10.1186/s12864-016-3104-5.

A pipeline for assembling low copy nuclear markers from plant genome skimming data for phylogenetic use.用于组装植物基因组刮削数据中低拷贝核标记的流水线，以便进行系统发育分析。

PeerJ. 2022 Dec 6;10:e14525. doi: 10.7717/peerj.14525. eCollection 2022.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

Phylogenomics from Whole Genome Sequences Using aTRAM.使用aTRAM从全基因组序列进行系统发育基因组学分析。

Syst Biol. 2017 Sep 1;66(5):786-798. doi: 10.1093/sysbio/syw105.

AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications.AMY 树：一种用于 Y 染色体系统发生应用的全基因组 SNP 调用算法。

BMC Genomics. 2013 Feb 13;14:101. doi: 10.1186/1471-2164-14-101.

引用本文的文献

Benchmarking the topological accuracy of bacterial phylogenomic workflows using evolution.使用进化基准测试细菌基因组系统发生工作流程的拓扑准确性。

Microb Genom. 2022 Mar;8(3). doi: 10.1099/mgen.0.000799.

-mer-Based Metagenomics Tools Provide a Fast and Sensitive Approach for the Detection of Viral Contaminants in Biopharmaceutical and Vaccine Manufacturing Applications Using Next-Generation Sequencing.基于宏基因组学的工具采用下一代测序技术，为生物制药和疫苗生产应用中病毒污染物的检测提供了一种快速、灵敏的方法。

mSphere. 2021 Apr 21;6(2):e01336-20. doi: 10.1128/mSphere.01336-20.

Whole-Genome Sequence Benchmark Dataset for Phylogenomic Pipelines.用于系统发育基因组学流程的全基因组序列基准数据集。

J Fungi (Basel). 2021 Mar 16;7(3):214. doi: 10.3390/jof7030214.

A broad survey of DNA sequence data simulation tools.DNA 序列数据模拟工具的广泛调查。

Brief Funct Genomics. 2020 Jan 22;19(1):49-59. doi: 10.1093/bfgp/elz033.

Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP.评估用于密切相关细菌分离株的 SNP 调用方法和一种新的高精度管道：BactSNP。

Microb Genom. 2019 May;5(5). doi: 10.1099/mgen.0.000261. Epub 2019 May 17.

Phylogenomic Pipeline Validation for Foodborne Pathogen Disease Surveillance.食源性致病菌疾病监测的系统发育基因组学管道验证。

J Clin Microbiol. 2019 Apr 26;57(5). doi: 10.1128/JCM.01816-18. Print 2019 May.

PeerJ. 2017 Oct 6;5:e3893. doi: 10.7717/peerj.3893. eCollection 2017.

本文引用的文献

Phylogenetic structure of European Enteritidis outbreak correlates with national and international egg distribution network.欧洲肠炎沙门氏菌暴发的系统发育结构与国家和国际鸡蛋分销网络相关。

Microb Genom. 2016 Aug 25;2(8):e000070. doi: 10.1099/mgen.0.000070. eCollection 2016 Aug.

Implementation of Nationwide Real-time Whole-genome Sequencing to Enhance Listeriosis Outbreak Detection and Investigation.实施全国范围的实时全基因组测序以加强李斯特菌病暴发的检测与调查。

Clin Infect Dis. 2016 Aug 1;63(3):380-6. doi: 10.1093/cid/ciw242. Epub 2016 Apr 18.

Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database.通过构建全基因组测序网络和数据库实现食品病原体溯源的实用价值

J Clin Microbiol. 2016 Aug;54(8):1975-83. doi: 10.1128/JCM.00081-16. Epub 2016 Mar 23.

The Listeria monocytogenes Core-Genome Sequence Typer (LmCGST): a bioinformatic pipeline for molecular characterization with next-generation sequence data.单核细胞增生李斯特菌核心基因组序列分型工具（LmCGST）：一种利用下一代测序数据进行分子特征分析的生物信息学流程。

BMC Microbiol. 2015 Oct 22;15:224. doi: 10.1186/s12866-015-0526-1.

Tracing Origins of the Salmonella Bareilly Strain Causing a Food-borne Outbreak in the United States.追溯沙门氏菌贝拉里亚种在美国食源性暴发疫情的源头。

J Infect Dis. 2016 Feb 15;213(4):502-8. doi: 10.1093/infdis/jiv297. Epub 2015 May 20.

Phylesystem: a git-based data store for community-curated phylogenetic estimates.系统发育体系：一个基于Git的用于社区策划系统发育估计的数据存储库。

Bioinformatics. 2015 Sep 1;31(17):2794-800. doi: 10.1093/bioinformatics/btv276. Epub 2015 May 4.

Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not.基于全基因组的细菌系统发育重建对重组具有稳健性，但群体推断则不然。

mBio. 2014 Nov 25;5(6):e02158. doi: 10.1128/mBio.02158-14.

Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.用于单核细胞增生李斯特菌短读长序列数据比对的参考序列和序列拼接程序的选择，对单核苷酸多态性（SNP）分析中的错误率有很大影响。

PLoS One. 2014 Aug 21;9(8):e104579. doi: 10.1371/journal.pone.0104579. eCollection 2014.

Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences.排除下一代测序中缺失数据的意外后果：RAD序列的模拟研究

Syst Biol. 2016 May;65(3):357-65. doi: 10.1093/sysbio/syu046. Epub 2014 Jul 4.

Toward better understanding of artifacts in variant calling from high-coverage samples.为了更好地理解高覆盖样本中变体调用中的伪影。

Bioinformatics. 2014 Oct 15;30(20):2843-51. doi: 10.1093/bioinformatics/btu356. Epub 2014 Jun 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

TreeToReads - 一种用于从系统发育树模拟原始读段的流程。

TreeToReads - a pipeline for simulating raw reads from phylogenies.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献