Department of Food Science, Cornell University, Ithaca, New York, USA.
Centre for Infectious Disease Genomics and One Health, Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada.
mSystems. 2023 Apr 27;8(2):e0128422. doi: 10.1128/msystems.01284-22. Epub 2023 Feb 27.
Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, "isolation source", field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. The regular analysis of whole-genome sequence data in collections such as NCBI's Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described.
大型开源 DNA 序列数据库的生成部分是通过在建筑物环境中擦拭表面来收集微生物病原体实现的。通过公共卫生监测对这些数据进行汇总分析需要对与拭子采样点位置相关的复杂、特定领域的元数据进行数字化。然而,目前采样点位置信息是在一个单一的、自由文本的“隔离源”字段中收集的,这导致描述信息的产生存在各种词序、粒度和语言错误,难以实现自动化,降低了机器的可操作性。我们评估了在常规食源性病原体监测期间生成的 1498 份自由文本拭子采样点描述。评估了自由文本元数据的词汇,以确定数据收集器使用的信息方面和独特术语的数量。使用开放生物学本体 (OBO) 铸造库来开发层次结构词汇,这些词汇通过逻辑关系连接起来,用于描述拭子采样点位置。通过内容分析,确定了 5 个信息方面和 338 个独特术语。开发了术语层次结构方面,并制定了关于这些五个领域内实体之间关系的陈述(称为公理)。通过这项研究开发的模式已集成到公共可用的病原体元数据标准中,为正在进行的监测和调查提供便利。2022 年开始,NCBI 生物样本库中可提供 One Health Enteric Package。元数据标准的共同使用增加了 DNA 序列数据库的互操作性,使大规模的数据共享和人工智能以及食品安全大数据解决方案成为可能。许多公共卫生组织经常使用 NCBI 的病原体检测数据库等数据库中全基因组序列数据的定期分析来检测传染病的爆发。然而,这些数据库中的分离物元数据通常不完整且质量较差。这些复杂的原始元数据通常需要重新组织并手动格式化以供汇总分析使用。这些过程效率低下且耗时,增加了公共卫生组织提取可操作信息所需的解释性劳动。通过开发一个国际适用的词汇系统,可以描述拭子采样点的位置,从而支持未来开放基因组流行病学网络的使用。