基于 Apache Spark 的生物信息学应用。

Bioinformatics applications on Apache Spark.

机构信息

College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.

Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China.

出版信息

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

DOI:10.1093/gigascience/giy098

PMID:30101283

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6113509/

Abstract

With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.

摘要

随着下一代测序技术的快速发展，越来越多的基因组数据对数据处理提出了巨大的挑战。因此，我们迫切需要高度可扩展和强大的计算系统。在最先进的并行计算平台中，Apache Spark 是一个快速、通用、基于内存、迭代计算框架，用于大规模数据处理，通过引入弹性分布式数据集抽象，确保了高容错性和高可扩展性。在性能方面，Spark 在内存访问方面的速度可以比 Hadoop 快 100 倍，在磁盘访问方面的速度可以快 10 倍。此外，它还提供了 Java、Scala、Python 和 R 中的高级应用程序编程接口。它还支持一些高级组件，包括用于结构化数据处理的 Spark SQL、用于机器学习的 MLlib、用于计算图的 GraphX 和用于流计算的 Spark Streaming。我们调查了基于 Spark 的在下一代测序和其他生物领域（如表观遗传学、系统发育学和药物发现）中的应用。该调查的结果用于提供一个全面的指南，允许生物信息学研究人员在自己的领域中应用 Spark。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b42/6113509/23a194e3cf17/giy098fig1.jpg

相似文献

Bioinformatics applications on Apache Spark.基于 Apache Spark 的生物信息学应用。

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。

PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.SparkSeq：一种快速、可扩展且适用于云环境的工具，可实现具有核苷酸精度的交互式基因组数据分析。

Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19.

VC@Scale: Scalable and high-performance variant calling on cluster environments.VC@Scale：在集群环境中进行可扩展且高性能的变体调用。

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework.SparkGA2：基于 Apache Spark 的生产质量、内存高效的基因组分析框架。

PLoS One. 2019 Dec 5;14(12):e0224784. doi: 10.1371/journal.pone.0224784. eCollection 2019.

Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.快速的云端蛋白质序列比对：HAMOND 将快速的 DIAMOND 比对与 Hadoop 并行处理相结合。

J Biotechnol. 2017 Sep 10;257:58-60. doi: 10.1016/j.jbiotec.2017.02.020. Epub 2017 Feb 21.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

GenAp: a distributed SQL interface for genomic data.GenAp：用于基因组数据的分布式SQL接口。

BMC Bioinformatics. 2016 Feb 4;17:63. doi: 10.1186/s12859-016-0904-1.

Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据：Apache Spark 与 MPI。

PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.

SparkBLAST: scalable BLAST processing using in-memory operations.SparkBLAST：使用内存操作的可扩展BLAST处理

BMC Bioinformatics. 2017 Jun 27;18(1):318. doi: 10.1186/s12859-017-1723-8.

引用本文的文献

Applicability Assessment of Technologies for Predictive and Prescriptive Analytics of Nephrology Big Data.肾脏病大数据预测性与规范性分析技术的适用性评估

Proteomics. 2025 Jun;25(11-12):e202400135. doi: 10.1002/pmic.202400135. Epub 2025 May 27.

Mechanisms and technologies in cancer epigenetics.癌症表观遗传学的机制与技术

Front Oncol. 2025 Jan 7;14:1513654. doi: 10.3389/fonc.2024.1513654. eCollection 2024.

Biomedical Big Data Technologies, Applications, and Challenges for Precision Medicine: A Review.生物医学大数据技术、精准医学中的应用及挑战：综述

Glob Chall. 2023 Nov 20;8(1):2300163. doi: 10.1002/gch2.202300163. eCollection 2024 Jan.

Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network.使用 NegEx 算法和卷积神经网络相结合的方法进行临床自然语言处理中的否定识别。

BMC Med Inform Decis Mak. 2023 Oct 13;23(1):216. doi: 10.1186/s12911-023-02301-5.

Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.使用 Apache Spark 分布式计算环境进行生物信息学分析的十个快速技巧。

PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul.

Fog-Based Smart Cardiovascular Disease Prediction System Powered by Modified Gated Recurrent Unit.基于雾计算的智能心血管疾病预测系统：由改进门控循环单元驱动

Diagnostics (Basel). 2023 Jun 15;13(12):2071. doi: 10.3390/diagnostics13122071.

A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。

PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.

Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Volatile Organic Compounds (VOCs) Protect from Fish Pathogen sp.: A Combined In Vitro, In Vivo, and In Silico Approach.挥发性有机化合物（VOCs）对鱼类病原体具有保护作用：一种体外、体内和计算机模拟相结合的方法。

Microorganisms. 2023 Jan 10;11(1):172. doi: 10.3390/microorganisms11010172.

Cloud-native distributed genomic pileup operations.云原生分布式基因组堆积操作。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac804.

本文引用的文献

SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data.SEQSpark：一种使用全基因组和外显子组序列数据进行大规模罕见变异关联研究的完整分析工具。

Am J Hum Genet. 2017 Jul 6;101(1):115-122. doi: 10.1016/j.ajhg.2017.05.017. Epub 2017 Jun 29.

SparkBLAST: scalable BLAST processing using in-memory operations.SparkBLAST：使用内存操作的可扩展BLAST处理

BMC Bioinformatics. 2017 Jun 27;18(1):318. doi: 10.1186/s12859-017-1723-8.

PASTASpark: multiple sequence alignment meets Big Data.PASTASpark：多重序列比对与大数据相遇。

Bioinformatics. 2017 Sep 15;33(18):2948-2950. doi: 10.1093/bioinformatics/btx354.

CloudPhylo: a fast and scalable tool for phylogeny reconstruction.云 phyl o：一种快速且可扩展的系统发育重建工具。

Bioinformatics. 2017 Feb 1;33(3):438-440. doi: 10.1093/bioinformatics/btw645.

MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes.MetaSpark：一种基于 Spark 的分布式处理工具，用于将宏基因组读取数据招募到参考基因组中。

Bioinformatics. 2017 Apr 1;33(7):1090-1092. doi: 10.1093/bioinformatics/btw750.

Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud.Falco：一个在云端快速且灵活的单细胞RNA测序处理框架。

Bioinformatics. 2017 Mar 1;33(5):767-769. doi: 10.1093/bioinformatics/btw732.

Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.生物火花：使用Hadoop和Spark对来自生物模拟和实验的大型数值数据集进行可扩展分析。

Bioinformatics. 2017 Jan 15;33(2):303-305. doi: 10.1093/bioinformatics/btw614. Epub 2016 Sep 22.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.SparkBWA：加速高通量DNA测序数据比对

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

VariantSpark: population scale clustering of genotype information.VariantSpark：基因型信息的群体规模聚类

BMC Genomics. 2015 Dec 10;16:1052. doi: 10.1186/s12864-015-2269-7.

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.BigBWA：使Burrows-Wheeler比对器向大数据技术靠拢

Bioinformatics. 2015 Dec 15;31(24):4003-5. doi: 10.1093/bioinformatics/btv506. Epub 2015 Aug 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 Apache Spark 的生物信息学应用。

Bioinformatics applications on Apache Spark.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献