Suppr超能文献

从生命科学角度构建Apache Spark

Framing Apache Spark in life sciences.

作者信息

Manconi Andrea, Gnocchi Matteo, Milanesi Luciano, Marullo Osvaldo, Armano Giuliano

机构信息

Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy.

Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy.

出版信息

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Abstract

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.

摘要

高通量和数字技术的进步要求采用大数据来处理生命科学中的复杂任务。然而,向大数据的转变使研究人员在存储、共享和分析大数据时面临技术和基础设施方面的挑战。事实上,这类任务需要分布式计算系统和算法来确保高效处理。前沿的分布式编程框架允许实现灵活的算法,能够在本地高性能计算集群或云架构上使计算适应数据。在这种背景下,Apache Spark是用于集群大规模数据处理的非常强大的高性能计算引擎。由于有用于处理结构化和关系型数据的专业库,它还支持机器学习、基于图的计算和流处理。这篇综述文章旨在帮助生命科学研究人员确定Apache Spark的特性,并评估它是否能在他们的研究活动中成功应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d3e/9958288/5359b177dde6/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验