从生命科学角度构建Apache Spark

Framing Apache Spark in life sciences.

作者信息

Manconi Andrea, Gnocchi Matteo, Milanesi Luciano, Marullo Osvaldo, Armano Giuliano

机构信息

Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy.

Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy.

出版信息

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

DOI:10.1016/j.heliyon.2023.e13368

PMID:36852030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9958288/

Abstract

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.

摘要

高通量和数字技术的进步要求采用大数据来处理生命科学中的复杂任务。然而，向大数据的转变使研究人员在存储、共享和分析大数据时面临技术和基础设施方面的挑战。事实上，这类任务需要分布式计算系统和算法来确保高效处理。前沿的分布式编程框架允许实现灵活的算法，能够在本地高性能计算集群或云架构上使计算适应数据。在这种背景下，Apache Spark是用于集群大规模数据处理的非常强大的高性能计算引擎。由于有用于处理结构化和关系型数据的专业库，它还支持机器学习、基于图的计算和流处理。这篇综述文章旨在帮助生命科学研究人员确定Apache Spark的特性，并评估它是否能在他们的研究活动中成功应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d3e/9958288/5359b177dde6/gr001.jpg

相似文献

Framing Apache Spark in life sciences.

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Big Data in metagenomics: Apache Spark vs MPI.

PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.

A distributed computing model for big data anonymization in the networks.

PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.

Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project.

Front Neurosci. 2016 Jan 6;9:492. doi: 10.3389/fnins.2015.00492. eCollection 2015.

MaRe: Processing Big Data with application containers on Apache Spark.

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.

VC@Scale: Scalable and high-performance variant calling on cluster environments.

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

Using Apache Spark on genome assembly for scalable overlap-graph reduction.

Hum Genomics. 2019 Oct 22;13(Suppl 1):48. doi: 10.1186/s40246-019-0227-1.

Bioinformatics applications on Apache Spark.

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

An Optimized IoT-enabled Big Data Analytics Architecture for Edge-Cloud Computing.

IEEE Internet Things J. 2023 Mar;10(5):3995-4005. doi: 10.1109/jiot.2022.3157552. Epub 2022 Mar 14.

Large-scale virtual screening on public cloud resources with Apache Spark.

J Cheminform. 2017 Mar 6;9:15. doi: 10.1186/s13321-017-0204-4. eCollection 2017.

引用本文的文献

Gut microbiota and tuberculosis.

Imeta. 2025 Jun 22;4(4):e70054. doi: 10.1002/imt2.70054. eCollection 2025 Aug.

Mechanisms and technologies in cancer epigenetics.

Front Oncol. 2025 Jan 7;14:1513654. doi: 10.3389/fonc.2024.1513654. eCollection 2024.

本文引用的文献

SparkEC: speeding up alignment-based DNA error correction tools.

BMC Bioinformatics. 2022 Nov 7;23(1):464. doi: 10.1186/s12859-022-05013-1.

SparkGC: Spark based genome compression for large collections of genomes.

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

Halvade somatic: Somatic variant calling with Apache Spark.

Gigascience. 2022 Jan 12;11(1). doi: 10.1093/gigascience/giab094.

VC@Scale: Scalable and high-performance variant calling on cluster environments.

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.

Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming.

J Healthc Eng. 2021 Apr 22;2021:6624829. doi: 10.1155/2021/6624829. eCollection 2021.

Compact and evenly distributed k-mer binning for genomic sequences.

Bioinformatics. 2021 Sep 9;37(17):2563-2569. doi: 10.1093/bioinformatics/btab156.

A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion.

J Biomed Inform. 2021 Mar;115:103688. doi: 10.1016/j.jbi.2021.103688. Epub 2021 Feb 2.

DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework.

Comput Math Methods Med. 2020 Sep 1;2020:7231205. doi: 10.1155/2020/7231205. eCollection 2020.

Orbit Image Analysis: An open-source whole slide image analysis tool.

PLoS Comput Biol. 2020 Feb 5;16(2):e1007313. doi: 10.1371/journal.pcbi.1007313. eCollection 2020 Feb.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从生命科学角度构建Apache Spark

Framing Apache Spark in life sciences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献