Department of Marine Biology and Oceanography, Institut de Ciències del Mar-CSIC, Barcelona, Catalonia, Spain.
Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada.
PLoS Biol. 2018 Sep 17;16(9):e2005849. doi: 10.1371/journal.pbio.2005849. eCollection 2018 Sep.
Environmental sequencing has greatly expanded our knowledge of micro-eukaryotic diversity and ecology by revealing previously unknown lineages and their distribution. However, the value of these data is critically dependent on the quality of the reference databases used to assign an identity to environmental sequences. Existing databases contain errors and struggle to keep pace with rapidly changing eukaryotic taxonomy, the influx of novel diversity, and computational challenges related to assembling the high-quality alignments and trees needed for accurate characterization of lineage diversity. EukRef (eukref.org) is an ongoing community-driven initiative that addresses these challenges by bringing together taxonomists with expertise spanning the eukaryotic tree of life and microbial ecologists, who use environmental sequence data to develop reliable reference databases across the diversity of microbial eukaryotes. EukRef organizes and facilitates rigorous mining and annotation of sequence data by providing protocols, guidelines, and tools. The EukRef pipeline and tools allow users interested in a particular group of microbial eukaryotes to retrieve all sequences belonging to that group from International Nucleotide Sequence Database Collaboration (INSDC) (GenBank, the European Nucleotide Archive [ENA], or the DNA DataBank of Japan [DDBJ]), to place those sequences in a phylogenetic tree, and to curate taxonomic and environmental information for the group. We provide guidelines to facilitate the process and to standardize taxonomic annotations. The final outputs of this process are (1) a reference tree and alignment, (2) a reference sequence database, including taxonomic and environmental information, and (3) a list of putative chimeras and other artifactual sequences. These products will be useful for the broad community as they become publicly available (at eukref.org) and are shared with existing reference databases.
环境测序通过揭示以前未知的谱系及其分布,极大地扩展了我们对微真核生物多样性和生态学的认识。然而,这些数据的价值取决于用于为环境序列分配身份的参考数据库的质量。现有数据库包含错误,并且难以跟上真核生物分类学的快速变化、新出现的多样性的涌入,以及与组装用于准确描述谱系多样性的高质量比对和树相关的计算挑战。EukRef(eukref.org)是一项正在进行的社区驱动倡议,通过汇集具有跨越真核生物树生命和微生物生态学专业知识的分类学家和微生物生态学家,解决了这些挑战,他们使用环境序列数据来开发跨微生物真核生物多样性的可靠参考数据库。EukRef 通过提供协议、指南和工具来组织和促进对序列数据的严格挖掘和注释。EukRef 管道和工具允许对特定微生物真核生物群感兴趣的用户从国际核苷酸序列数据库合作组织 (INSDC)(GenBank、欧洲核苷酸档案库 [ENA] 或日本 DNA 数据库 [DDBJ])检索属于该群的所有序列,将这些序列置于系统发育树中,并为该群整理分类学和环境信息。我们提供指南以促进该过程并实现分类学注释的标准化。该过程的最终输出是:(1) 参考树和比对,(2) 包含分类学和环境信息的参考序列数据库,以及 (3) 推定嵌合体和其他人为序列的列表。随着这些产品在 eukref.org 上公开并与现有参考数据库共享,它们将对广大社区非常有用。