Groussman R D, Blaskowski S, Coesel S N, Armbrust E V
School of Oceanography, University of Washington, Benjamin Hall IRB, Room 306 616 NE Northlake Place, Seattle, WA, 98105, USA.
Molecular Engineering and Sciences Institute, University of Washington, Molecular Engineering & Sciences Building 3946 W Stevens Way NE, Seattle, WA, 98195, USA.
Sci Data. 2023 Dec 21;10(1):926. doi: 10.1038/s41597-023-02842-4.
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of about half of eukaryotic environmental transcripts. Here, we introduce Marine Functional EukaRyotic Reference Taxa (MarFERReT), a marine microbial eukaryotic sequence library designed for use with taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 publicly accessible marine eukaryote genomes and transcriptomes and assessed their sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in MarFERReT. Version 1.1 of MarFERReT contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR Taxonomy identifiers and Pfam functional annotations. The MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.
宏转录组学可生成有关自然环境中转录基因的大量序列数据。这些数据集的分类注释取决于经过整理的参考序列的可用性。对于海洋微生物真核生物而言,当前的参考文库受到测序生物多样性缺口以及用新序列数据更新文库的障碍的限制,导致约一半的真核生物环境转录本的分类注释受到影响。在此,我们引入了海洋功能性真核生物参考分类群(MarFERReT),这是一个设计用于真核生物宏转录组分类注释的海洋微生物真核生物序列文库。我们收集了902个可公开获取的海洋真核生物基因组和转录组,并评估了它们的序列质量和交叉污染问题,选择了800个经过验证的条目纳入MarFERReT。MarFERReT 1.1版本包含来自800个海洋真核生物基因组和转录组的参考序列,涵盖453个物种和菌株水平的分类群,共有近2800万个蛋白质序列,并带有相关的NCBI和PR分类标识符以及Pfam功能注释。MarFERReT项目存储库托管容器化的构建脚本、关于安装和用例示例的文档以及有关MarFERReT新版本的信息。