Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center.
Genomics and Computational Biology Graduate Group.
Bioinformatics. 2020 Jun 1;36(12):3879-3881. doi: 10.1093/bioinformatics/btaa246.
We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources.
SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno.
lswang@pennmedicine.upenn.edu.
Supplementary data are available at Bioinformatics online.
我们报告了基于 Spark 的非编码遗传变异分子机制推断(SparkINFERNO),这是一个可扩展的生物信息学管道,用于描述全基因组关联研究(GWAS)关联发现中的非编码基因组。SparkINFERNO 优先考虑 GWAS 关联信号背后的因果变异,并报告相关的调控元件、组织背景以及它们影响的可能靶基因。为了实现这一目标,SparkINFERNO 算法将 GWAS 汇总统计信息与功能基因组学数据集的大规模集合集成在一起,这些数据集涵盖了增强子活性、转录因子结合、表达数量性状基因座和其他功能数据集,跨越 400 多种组织和细胞类型。通过使用 Apache Spark 和基于 Giggle 的基因组索引实现的底层 API 实现了可扩展性。我们在大型 GWAS 上评估了 SparkINFERNO,并表明 SparkINFERNO 的效率比其他方法提高了 60 多倍,并且可以根据数据大小和计算资源量进行扩展。
SparkINFERNO 在具有 Apache Spark 环境的集群或单个服务器上运行,可在 https://bitbucket.org/wanglab-upenn/SparkINFERNO 或 https://hub.docker.com/r/wanglab/spark-inferno 上获得。
lswang@pennmedicine.upenn.edu。
补充数据可在 Bioinformatics 在线获得。