Suppr超能文献

用于系统发育基因组学流程验证的基准数据集,在食源性病原体监测中的应用。

Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.

作者信息

Timme Ruth E, Rand Hugh, Shumway Martin, Trees Eija K, Simmons Mustafa, Agarwala Richa, Davis Steven, Tillman Glenn E, Defibaugh-Chavez Stephanie, Carleton Heather A, Klimke William A, Katz Lee S

机构信息

Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America.

National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of America.

出版信息

PeerJ. 2017 Oct 6;5:e3893. doi: 10.7717/peerj.3893. eCollection 2017.

Abstract

BACKGROUND

As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines.

METHODS

We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format.

RESULTS

Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens (, , , and ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets.

DISCUSSION

These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.

摘要

背景

随着下一代测序技术的发展,用于确定进化关系以作为公共卫生中流行病学关系代理的基因组规模分析程序也取得了相应进展。大多数新程序跳过了直系同源物确定和多基因比对的传统步骤,而是识别一组基因组中的变异,然后将结果汇总到一个单核苷酸多态性或等位基因矩阵中,用于标准的系统发育分析。然而,公共卫生当局需要用适当且全面的数据集记录这些方法的性能,以便能够针对特定目的(例如疫情监测)进行验证。在此,我们提出了一组基准数据集,用于系统发育基因组学流程的比较和验证。

方法

我们确定了四个有详细记录的食源性病原体事件,其流行病学情况与常规的系统发育基因组学分析(基于参考的单核苷酸多态性和全基因组多位点序列分型方法)一致。这些是理想的基准数据集,因为每个数据集的系统发育树、全基因组测序数据和流行病学数据都相互吻合。我们已将这些序列数据、样本元数据和“已知”的系统发育树放置在可公开访问的数据库中,并开发了一种标准的描述性电子表格格式来描述每个数据集。为便于轻松下载这些基准数据集,我们开发了一个使用标准描述性电子表格格式的自动化脚本。

结果

我们的“疫情”基准数据集代表了四种主要的食源性病原体细菌(、、和)以及一个模拟数据集,其中“已知树”可被准确地称为“真实树”。下载脚本和相关的表格文件可在GitHub上获取:https://github.com/WGS-standards-and-analysis/datasets

讨论

这五个基准数据集将有助于规范当前和未来系统发育基因组学流程的比较,并促进重要的跨机构合作。我们的工作是为序列数据和分析工具提供协作基础设施的全球努力的一部分——我们欢迎以我们推荐的格式提供更多基准数据集,并且如果相关,我们将把这些数据集添加到我们的GitHub网站上。总之,这些数据集、数据集格式以及底层的GitHub基础设施为系统发育基因组学流程的全球标准化提供了一条推荐路径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc20/5782805/4ffaf0786209/peerj-05-3893-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验