Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Nat Methods. 2024 Jun;21(6):994-1002. doi: 10.1038/s41592-024-02280-z. Epub 2024 May 16.
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
在资源中搜索大量且快速增长的核苷酸内容,例如序列读取档案中的运行和全基因组鸟枪法测序项目在 GenBank 中的组装,目前对大多数研究人员来说是不切实际的。在这里,我们介绍了 Pebblescout,它通过提供索引和搜索功能来导航这些内容。索引使用资源中序列的密集采样。搜索找到与用户查询有短序列匹配的主题(运行或组装),并使用匹配的信息量对其进行定义良好的排名。我们通过创建八个索引超过 3.7 千万亿字节的数据库来展示 Pebblescout 的功能。Pebblescout 的网络服务可以在 https://pebblescout.ncbi.nlm.nih.gov 上访问。我们表明,对于广泛的查询长度,Pebblescout 为寻找大型核苷酸资源的相关子集提供了一种数据驱动的方法,大大减少了下游分析的工作量。我们还表明,Pebblescout 的结果与 MetaGraph 和 Sourmash 相比具有优势。