• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于 SARS-CoV-2 监测生物信息学的基准数据集。

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics.

机构信息

Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.

Broad Institute of MIT and Harvard, Cambridge, MA, United States of America.

出版信息

PeerJ. 2022 Sep 5;10:e13821. doi: 10.7717/peerj.13821. eCollection 2022.

DOI:10.7717/peerj.13821
PMID:36093336
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9454940/
Abstract

BACKGROUND

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.

METHODS

We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.

RESULTS

The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2.

DISCUSSION

The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.

摘要

背景

严重急性呼吸系统综合症冠状病毒 2(SARS-CoV-2)是导致 2019 年冠状病毒病(COVID-19)的病原体,其已在全球范围内传播,并通过国际基因组测序工作进行监测。监测包括样本采集、文库制备和全基因组测序。这就需要制定详细的关注变异株(VOC)和感兴趣变异株(VOI)分类方案,并快速扩展用于序列分析的生物信息学工具。这些生物信息学工具是得出主要结果的手段:维持质量保证和检查、定义种群结构、进行基因组流行病学研究,并推断谱系,以实现可靠和可操作的识别和分类。此外,大流行要求公共卫生实验室迅速达到测序文库制备和下游数据分析的高通量水平。然而,这两个过程都可能因缺乏标准化的序列数据集而受到限制。

方法

我们从最近的出版物、公共数据库和内部资源中确定了六个 SARS-CoV-2 序列数据集。此外,我们还创建了一种从公共数据库中挖掘代表基因组的方法,用于这些数据集。使用这种新方法,我们确定了一些基因组作为 VOI/VOC 代表或非 VOI/VOC 代表。为了描述每个数据集,我们使用了之前发表的数据集格式,该格式描述了访问信息和整个数据集信息。此外,还增强了同一出版物中的一个脚本,以从本研究中下载并验证所有数据。

结果

基准数据集侧重于两个最常用的测序平台:来自牛津纳米孔技术平台的长读测序数据和来自 Illumina 平台的短读测序数据。有六个数据集:三个来自最近的出版物;两个来自于挖掘公共数据库以回答未涵盖在已发表数据集中的常见问题;一个独特的数据集代表了常见的序列失败,是通过严格审查未通过质量检查的数据获得的。数据集汇总表、数据挖掘脚本和所有序列数据的质量控制(QC)值都可在 GitHub 上公开获取:https://github.com/CDCgov/datasets-sars-cov-2。

讨论

这里呈现的数据集旨在帮助公共卫生实验室建立测序和生物信息学能力,基准不同的工作流程和管道,并校准 QC 阈值以确保测序质量。这些领域的改进共同支持准确和及时的暴发调查和监测,为大流行管理提供可操作的数据。此外,这些公开可用且标准化的基准数据将促进新管道的开发和裁决。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ad/9454940/804897f0ee06/peerj-10-13821-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ad/9454940/804897f0ee06/peerj-10-13821-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ad/9454940/804897f0ee06/peerj-10-13821-g001.jpg

相似文献

1
Benchmark datasets for SARS-CoV-2 surveillance bioinformatics.用于 SARS-CoV-2 监测生物信息学的基准数据集。
PeerJ. 2022 Sep 5;10:e13821. doi: 10.7717/peerj.13821. eCollection 2022.
2
Short-Read and Long-Read Whole Genome Sequencing for SARS-CoV-2 Variants Identification.用于鉴定SARS-CoV-2变体的短读长和长读长全基因组测序
Viruses. 2025 Apr 18;17(4):584. doi: 10.3390/v17040584.
3
Proficiency testing for SARS-CoV-2 whole genome sequencing.SARS-CoV-2 全基因组测序能力验证。
Pathology. 2022 Aug;54(5):615-622. doi: 10.1016/j.pathol.2022.04.002. Epub 2022 Jun 29.
4
Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance.新冠病毒 S 基因序列数据的生物信息学研究:大流行监测期间稳健基因组分析的启示。
Microb Genom. 2023 Nov;9(11). doi: 10.1099/mgen.0.001146.
5
Rapid, high-throughput, cost-effective whole-genome sequencing of SARS-CoV-2 using a condensed library preparation of the Illumina DNA Prep kit.采用 Illumina DNA Prep 试剂盒浓缩文库制备方法,对 SARS-CoV-2 进行快速、高通量、具有成本效益的全基因组测序。
J Clin Microbiol. 2024 Mar 13;62(3):e0010322. doi: 10.1128/jcm.00103-22. Epub 2024 Feb 5.
6
Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows.针对 SARS-CoV-2 基因组序列在生物信息学工作流程中的统一变异调用的建议。
Viruses. 2024 Mar 11;16(3):430. doi: 10.3390/v16030430.
7
INSaFLU-TELEVIR: an open web-based bioinformatics suite for viral metagenomic detection and routine genomic surveillance.INSaFLU-TELEVIR:一个基于网络的开放式生物信息学套件,用于病毒宏基因组检测和常规基因组监测。
Genome Med. 2024 Apr 25;16(1):61. doi: 10.1186/s13073-024-01334-3.
8
Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.用于系统发育基因组学流程验证的基准数据集,在食源性病原体监测中的应用。
PeerJ. 2017 Oct 6;5:e3893. doi: 10.7717/peerj.3893. eCollection 2017.
9
PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.PHA4GE 质量控制上下文数据标签:用于共享具有已知质量问题的公共卫生序列数据集的标准化注释,以促进测试和培训。
Microb Genom. 2024 Jun;10(6). doi: 10.1099/mgen.0.001260.
10
Continued improvement in the development of the SARS-CoV-2 whole genome sequencing proficiency testing program.SARS-CoV-2 全基因组测序能力验证计划的持续改进。
Pathology. 2024 Aug;56(5):717-725. doi: 10.1016/j.pathol.2024.02.010. Epub 2024 Apr 24.

引用本文的文献

1
PathoSeq-QC: a decision support bioinformatics workflow for robust genomic surveillance.PathoSeq-QC:一种用于可靠基因组监测的决策支持生物信息学工作流程。
Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf102.
2
SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL), a Snakemate workflow for rapid and bulk analysis of Illumina sequencing of SARS-CoV-2 genomes.严重急性呼吸综合征冠状病毒2型Illumina基因组装配线(SIGNAL),一种用于快速批量分析严重急性呼吸综合征冠状病毒2型基因组Illumina测序的Snakemate工作流程。
NAR Genom Bioinform. 2024 Dec 18;6(4):lqae176. doi: 10.1093/nargab/lqae176. eCollection 2024 Dec.
3
Genome-wide identification and molecular evolution of elongation family of very long chain fatty acids proteins in Cyrtotrachelus buqueti.

本文引用的文献

1
The SARS-CoV-2 Alpha variant was associated with increased clinical severity of COVID-19 in Scotland: A genomics-based retrospective cohort analysis.SARS-CoV-2 Alpha 变体与苏格兰 COVID-19 临床严重程度增加相关:基于基因组学的回顾性队列分析。
PLoS One. 2023 Apr 13;18(4):e0284187. doi: 10.1371/journal.pone.0284187. eCollection 2023.
2
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
3
Global landscape of SARS-CoV-2 genomic surveillance and data sharing.
在步甲属昆虫 Cyrtotrachelus buqueti 中,伸长家族的极长链脂肪酸蛋白的全基因组鉴定和分子进化。
BMC Genomics. 2024 Aug 2;25(1):758. doi: 10.1186/s12864-024-10658-8.
4
PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.PHA4GE 质量控制上下文数据标签:用于共享具有已知质量问题的公共卫生序列数据集的标准化注释,以促进测试和培训。
Microb Genom. 2024 Jun;10(6). doi: 10.1099/mgen.0.001260.
5
Lessons learned: overcoming common challenges in reconstructing the SARS-CoV-2 genome from short-read sequencing data via CoVpipe2.经验教训:通过CoVpipe2从短读长测序数据重建严重急性呼吸综合征冠状病毒2(SARS-CoV-2)基因组时克服常见挑战。
F1000Res. 2024 Apr 16;12:1091. doi: 10.12688/f1000research.136683.1. eCollection 2023.
6
Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance.新冠病毒 S 基因序列数据的生物信息学研究:大流行监测期间稳健基因组分析的启示。
Microb Genom. 2023 Nov;9(11). doi: 10.1099/mgen.0.001146.
全球 SARS-CoV-2 基因组监测和数据共享的格局。
Nat Genet. 2022 Apr;54(4):499-507. doi: 10.1038/s41588-022-01033-y. Epub 2022 Mar 28.
4
Exponential growth, high prevalence of SARS-CoV-2, and vaccine effectiveness associated with the Delta variant.Delta 变异株导致的指数级增长、高 SARS-CoV-2 流行率和疫苗效力。
Science. 2021 Dec 17;374(6574):eabl9551. doi: 10.1126/science.abl9551.
5
STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.STAT:一种快速、可扩展的基于 MinHash 的 k-mer 工具,用于评估 Sequence Read Archive 下一代序列提交。
Genome Biol. 2021 Sep 20;22(1):270. doi: 10.1186/s13059-021-02490-0.
6
Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool.使用穿山甲工具对新出现的大流行中的流行病学谱系进行分类。
Virus Evol. 2021 Jul 30;7(2):veab064. doi: 10.1093/ve/veab064. eCollection 2021.
7
The origins and potential future of SARS-CoV-2 variants of concern in the evolving COVID-19 pandemic.在不断演变的 COVID-19 大流行中,关注的 SARS-CoV-2 变体的起源和潜在未来。
Curr Biol. 2021 Jul 26;31(14):R918-R929. doi: 10.1016/j.cub.2021.06.049. Epub 2021 Jun 23.
8
Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.超快现有树木样本放置 (UShER) 可实现 SARS-CoV-2 大流行的实时系统发生学。
Nat Genet. 2021 Jun;53(6):809-816. doi: 10.1038/s41588-021-00862-7. Epub 2021 May 10.
9
New SARS-CoV-2 Variants - Clinical, Public Health, and Vaccine Implications.新型严重急性呼吸综合征冠状病毒2变体——对临床、公共卫生及疫苗的影响
N Engl J Med. 2021 May 13;384(19):1866-1868. doi: 10.1056/NEJMc2100362. Epub 2021 Mar 24.
10
Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England.在英格兰,估计 SARS-CoV-2 谱系 B.1.1.7 的传染性和影响。
Science. 2021 Apr 9;372(6538). doi: 10.1126/science.abg3055. Epub 2021 Mar 3.