Pierce N Tessa, Irber Luiz, Reiter Taylor, Brooks Phillip, Brown C Titus
Department of Population Health and Reproduction, University of California, Davis, Davis, California, 95616, USA.
F1000Res. 2019 Jul 4;8:1006. doi: 10.12688/f1000research.19675.1. eCollection 2019.
The sourmash software package uses MinHash-based sketching to create "signatures", compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.
sourmash软件包使用基于MinHash的草图绘制来创建“签名”,即DNA、RNA和蛋白质序列的压缩表示形式,这些“签名”可以存储、搜索、探索并进行分类注释。sourmash签名可用于快速且在低内存条件下估计非常大的数据集之间的序列相似性,还可用于在大型基因组数据库中搜索与查询基因组和宏基因组相匹配的序列。sourmash用C++、Rust和Python实现,可在BSD许可下从http://github.com/dib-lab/sourmash免费获取。