Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland.
Department of Computer Science, Poznan University of Technology, Poznań, Poland.
Bioinformatics. 2019 Jun 1;35(12):2156-2158. doi: 10.1093/bioinformatics/bty940.
Efficient processing of large-scale genomic datasets has recently become possible due to the application of 'big data' technologies in bioinformatics pipelines. We present SeQuiLa-a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing.
The project is available at http://biodatageeks.org/sequila/.
Supplementary data are available at Bioinformatics online.
由于“大数据”技术在生物信息学管道中的应用,最近大规模基因组数据集的处理变得成为可能。我们提出了 SeQuiLa,这是一种分布式的、符合 ANSI SQL 的解决方案,用于快速查询和处理基因组区间,它作为 Apache Spark 包提供。所提出的范围连接策略比默认的 Apache Spark 实现快得多(∼22×),并且优于其他用于基因组区间处理的最先进工具。
该项目可在 http://biodatageeks.org/sequila/ 获得。
补充数据可在生物信息学在线获得。