Dugan Vivien G, Emrich Scott J, Giraldo-Calderón Gloria I, Harb Omar S, Newman Ruchi M, Pickett Brett E, Schriml Lynn M, Stockwell Timothy B, Stoeckert Christian J, Sullivan Dan E, Singh Indresh, Ward Doyle V, Yao Alison, Zheng Jie, Barrett Tanya, Birren Bruce, Brinkac Lauren, Bruno Vincent M, Caler Elizabet, Chapman Sinéad, Collins Frank H, Cuomo Christina A, Di Francesco Valentina, Durkin Scott, Eppinger Mark, Feldgarden Michael, Fraser Claire, Fricke W Florian, Giovanni Maria, Henn Matthew R, Hine Erin, Hotopp Julie Dunning, Karsch-Mizrachi Ilene, Kissinger Jessica C, Lee Eun Mi, Mathur Punam, Mongodin Emmanuel F, Murphy Cheryl I, Myers Garry, Neafsey Daniel E, Nelson Karen E, Nierman William C, Puzak Julia, Rasko David, Roos David S, Sadzewicz Lisa, Silva Joana C, Sobral Bruno, Squires R Burke, Stevens Rick L, Tallon Luke, Tettelin Herve, Wentworth David, White Owen, Will Rebecca, Wortman Jennifer, Zhang Yun, Scheuermann Richard H
J. Craig Venter Institute, Rockville, Maryland, and La Jolla, California, United States of America; National Institute of Allergy and Infectious Diseases, Rockville, Maryland, United States of America.
University of Notre Dame, Notre Dame, Indiana, United States of America.
PLoS One. 2014 Jun 17;9(6):e99979. doi: 10.1371/journal.pone.0099979. eCollection 2014.
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium's minimal information (MIxS) and NCBI's BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
高通量测序加速了数千种人类传染病病原体及其数十种传播媒介的基因组序列测定。这些数据的规模和范围使得基因型-表型关联研究能够确定病原体毒力和药物/杀虫剂抗性的遗传决定因素,以及系统发育研究能够追踪疾病爆发的起源和传播。为了最大限度地利用基因组序列实现这些目的,收集病原体/媒介分离株特征的元数据并以有组织、清晰和一致的格式提供至关重要。在此,我们报告了由传染病基因组测序中心(GSCIDs)、传染病生物信息学资源中心(BRCs)以及美国国立卫生研究院(NIH)下属的美国国立过敏和传染病研究所(NIAID)的代表们共同开发的GSCID/BRC项目和样本应用标准,该标准是在与众多合作科学家的互动基础上制定的。它包括映射到其他数据标准倡议中的术语,包括基因组标准联盟的最小信息(MIxS)、NCBI的生物样本/生物项目清单以及生物医学调查本体(OBI)。该标准包括有关标本的生物体或环境来源特征的数据字段、标本分离事件的时空信息、分离出的病原体/媒介的表型特征以及项目领导和支持信息。通过将元数据字段建模到基于本体的语义框架中并重用现有的本体和最小信息清单,该应用标准可以扩展以支持其他特定项目的数据字段,并与以可比标准表示的其他数据集成。所有正在进行和未来的GSCID测序项目使用此元数据标准将在BRC资源和其他利用这些数据的存储库中提供这些数据的一致表示,使研究人员能够识别相关的基因组序列并进行具有统计学意义和生物学相关性的比较基因组学分析。