National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
Genome Biol. 2021 Sep 20;22(1):270. doi: 10.1186/s13059-021-02490-0.
Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms.
序列读取档案(Sequence Read Archive)提交给国家生物技术信息中心(National Center for Biotechnology Information)时往往缺乏有用的元数据,这限制了这些提交的用途。我们描述了序列分类分析工具(Sequence Taxonomic Analysis Tool,STAT),这是一种基于可扩展 k-mer 的工具,用于快速评估提交内容固有的分类多样性,而不依赖元数据。我们表明,我们基于 MinHash 的 k-mer 工具准确且可扩展,为科学界高效选择进一步分析的数据提供了可靠的标准,既能验证提交内容,又能为样本元数据添加可靠、可搜索的分类术语。