Biodiversity Research Centre, Academia Sinica, Taipei 11529, Taiwan.
Smithsonian Tropical Research Institute, Panama City, Republic of Panama.
Sci Data. 2017 Mar 14;4:170027. doi: 10.1038/sdata.2017.27.
Mitochondrial-encoded genes are increasingly targeted in studies using high-throughput sequencing approaches for characterizing metazoan communities from environmental samples (e.g., plankton, meiofauna, filtered water). Yet, unlike nuclear ribosomal RNA markers, there is to date no high-quality reference dataset available for taxonomic assignments. Here, we retrieved all metazoan mitochondrial gene sequences from GenBank, and then quality filtered and formatted the datasets for taxonomic assignments using taxonomic assignment tools. The reference datasets-'Midori references'-are available for download at www.reference-midori.info. Two versions are provided: (I) Midori-UNIQUE that contains all unique haplotypes associated with each species and (II) Midori-LONGEST that contains a single sequence, the longest, for each species. Overall, the mitochondrial Cytochrome oxidase subunit I gene was the most sequence-rich gene. However, sequences of the mitochondrial large ribosomal subunit RNA and Cytochrome b apoenzyme genes were observed for a large number of species in some phyla. The Midori reference is compatible with some taxonomic assignment software. Therefore, automated high-throughput sequence taxonomic assignments can be particularly effective using these datasets.
线粒体编码基因越来越多地成为使用高通量测序方法来描述环境样本(如浮游生物、小型后生动物、过滤水)中的后生动物群落的研究目标。然而,与核核糖体 RNA 标记不同,迄今为止,尚无用于分类分配的高质量参考数据集。在这里,我们从 GenBank 中检索了所有后生动物的线粒体基因序列,然后使用分类分配工具对数据集进行质量过滤和格式化为分类分配。参考数据集-'Midori 参考'-可在 www.reference-midori.info 下载。提供了两个版本:(I)包含与每个物种相关的所有独特单倍型的 Midori-UNIQUE 和(II)包含每个物种的单个序列、最长的 Midori-LONGEST。总体而言,线粒体细胞色素氧化酶亚基 I 基因是序列最丰富的基因。然而,在一些门中,许多物种的线粒体大亚基 RNA 和细胞色素 b 脱辅基酶基因的序列都有观察到。Midori 参考与一些分类分配软件兼容。因此,使用这些数据集进行自动化高通量序列分类分配可能特别有效。