用于开源 DNA 序列数据库中数字化表面拭子采样地点元数据的方案。

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

机构信息

Department of Food Science, Cornell University, Ithaca, New York, USA.

Centre for Infectious Disease Genomics and One Health, Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada.

出版信息

mSystems. 2023 Apr 27;8(2):e0128422. doi: 10.1128/msystems.01284-22. Epub 2023 Feb 27.

DOI:10.1128/msystems.01284-22

PMID:36847566

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10134794/

Abstract

Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, "isolation source", field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. The regular analysis of whole-genome sequence data in collections such as NCBI's Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described.

摘要

大型开源 DNA 序列数据库的生成部分是通过在建筑物环境中擦拭表面来收集微生物病原体实现的。通过公共卫生监测对这些数据进行汇总分析需要对与拭子采样点位置相关的复杂、特定领域的元数据进行数字化。然而，目前采样点位置信息是在一个单一的、自由文本的“隔离源”字段中收集的，这导致描述信息的产生存在各种词序、粒度和语言错误，难以实现自动化，降低了机器的可操作性。我们评估了在常规食源性病原体监测期间生成的 1498 份自由文本拭子采样点描述。评估了自由文本元数据的词汇，以确定数据收集器使用的信息方面和独特术语的数量。使用开放生物学本体 (OBO) 铸造库来开发层次结构词汇，这些词汇通过逻辑关系连接起来，用于描述拭子采样点位置。通过内容分析，确定了 5 个信息方面和 338 个独特术语。开发了术语层次结构方面，并制定了关于这些五个领域内实体之间关系的陈述（称为公理）。通过这项研究开发的模式已集成到公共可用的病原体元数据标准中，为正在进行的监测和调查提供便利。2022 年开始，NCBI 生物样本库中可提供 One Health Enteric Package。元数据标准的共同使用增加了 DNA 序列数据库的互操作性，使大规模的数据共享和人工智能以及食品安全大数据解决方案成为可能。许多公共卫生组织经常使用 NCBI 的病原体检测数据库等数据库中全基因组序列数据的定期分析来检测传染病的爆发。然而，这些数据库中的分离物元数据通常不完整且质量较差。这些复杂的原始元数据通常需要重新组织并手动格式化以供汇总分析使用。这些过程效率低下且耗时，增加了公共卫生组织提取可操作信息所需的解释性劳动。通过开发一个国际适用的词汇系统，可以描述拭子采样点的位置，从而支持未来开放基因组流行病学网络的使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/368c/10134794/19dc5f5c7ec6/msystems.01284-22-f001.jpg

相似文献

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

mSystems. 2023 Apr 27;8(2):e0128422. doi: 10.1128/msystems.01284-22. Epub 2023 Feb 27.

Interpretative Labor and the Bane of Nonstandardized Metadata in Public Health Surveillance and Food Safety.

Clin Infect Dis. 2021 Oct 20;73(8):1537-1539. doi: 10.1093/cid/ciab615.

Standardized metadata for human pathogen/vector genomic sequences.

PLoS One. 2014 Jun 17;9(6):e99979. doi: 10.1371/journal.pone.0099979. eCollection 2014.

"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".

BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.

GeMInA, Genomic Metadata for Infectious Agents, a geospatial surveillance pathogen database.

Nucleic Acids Res. 2010 Jan;38(Database issue):D754-64. doi: 10.1093/nar/gkp832. Epub 2009 Oct 22.

Pathogen metadata platform: software for accessing and analyzing pathogen strain information.

BMC Bioinformatics. 2016 Sep 15;17(1):379. doi: 10.1186/s12859-016-1231-2.

Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package.

Gigascience. 2022 Feb 16;11. doi: 10.1093/gigascience/giac003.

Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies.

mSphere. 2022 Jun 29;7(3):e0007722. doi: 10.1128/msphere.00077-22. Epub 2022 May 2.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.

Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.

引用本文的文献

A simulation model to quantify the efficacy of dry cleaning interventions on a contaminated milk powder line.

Appl Environ Microbiol. 2025 May 21;91(5):e0208624. doi: 10.1128/aem.02086-24. Epub 2025 Apr 17.

The choice of 16S rRNA gene sequence analysis impacted characterization of highly variable surface microbiota in dairy processing environments.

mSystems. 2024 Nov 19;9(11):e0062024. doi: 10.1128/msystems.00620-24. Epub 2024 Oct 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于开源 DNA 序列数据库中数字化表面拭子采样地点元数据的方案。

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

机构信息

Department of Food Science, Cornell University, Ithaca, New York, USA.

Centre for Infectious Disease Genomics and One Health, Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada.

出版信息

mSystems. 2023 Apr 27;8(2):e0128422. doi: 10.1128/msystems.01284-22. Epub 2023 Feb 27.

DOI:10.1128/msystems.01284-22

PMID:36847566

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10134794/

Abstract

摘要

用于开源 DNA 序列数据库中数字化表面拭子采样地点元数据的方案。

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于开源 DNA 序列数据库中数字化表面拭子采样地点元数据的方案。

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

机构信息

出版信息

相似文献

引用本文的文献