Laboratory of Genetics and Genomics, National Institute on Aging, Intramural Research Program, National Institutes of Health, Baltimore, MD, 21224, USA.
BMC Bioinformatics. 2021 Oct 2;22(1):474. doi: 10.1186/s12859-021-04390-3.
The Sequence Alignment/Map Format Specification (SAM) is one of the most widely adopted file formats in bioinformatics and many researchers use it daily. Several tools, including most high-throughput sequencing read aligners, use it as their primary output and many more tools have been developed to process it. However, despite its flexibility, SAM encoded files can often be difficult to query and understand even for experienced bioinformaticians. As genomic data are rapidly growing, structured, and efficient queries on data that are encoded in SAM/BAM files are becoming increasingly important. Existing tools are very limited in their query capabilities or are not efficient. Critically, new tools that address these shortcomings, should not be able to support existing large datasets but should also do so without requiring massive data transformations and file infrastructure reorganizations.
Here we introduce SamQL, an SQL-like query language for the SAM format with intuitive syntax that supports complex and efficient queries on top of SAM/BAM files and that can replace commonly used Bash one-liners employed by many bioinformaticians. SamQL has high expressive power with no upper limit on query size and when parallelized, outperforms other substantially less expressive software.
SamQL is a complete query language that we envision as a step to a structured database engine for genomics. SamQL is written in Go, and is freely available as standalone program and as an open-source library under an MIT license, https://github.com/maragkakislab/samql/ .
序列比对/映射格式规范(SAM)是生物信息学中使用最广泛的文件格式之一,许多研究人员每天都在使用它。包括大多数高通量测序读对齐工具在内的几个工具都使用它作为其主要输出,并且已经开发了更多的工具来处理它。然而,尽管 SAM 编码文件具有灵活性,但即使对于有经验的生物信息学家来说,它们也常常难以查询和理解。随着基因组数据的快速增长,对以 SAM/BAM 文件编码的数据进行结构化和高效查询变得越来越重要。现有的工具在查询功能方面非常有限,或者效率不高。关键是,新的工具应该能够解决这些缺点,不仅能够支持现有的大型数据集,而且不需要进行大规模的数据转换和文件基础设施重组。
在这里,我们介绍了 SamQL,这是一种用于 SAM 格式的类似 SQL 的查询语言,它具有直观的语法,支持在 SAM/BAM 文件之上进行复杂而高效的查询,并且可以替代许多生物信息学家常用的 Bash 单行命令。SamQL 具有很高的表达能力,没有查询大小的上限,并且在并行化时,性能优于其他表达能力低得多的软件。
SamQL 是一种完整的查询语言,我们将其视为基因组学结构化数据库引擎的一步。SamQL 是用 Go 编写的,作为一个独立的程序和一个 MIT 许可证下的开源库免费提供,https://github.com/maragkakislab/samql/ 。