• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

数据密集型管道的 Spark 流反向压力性能评估分析。

Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines.

机构信息

Institute of Informatics, Federal University of Rio Grande do Sul, UFRGS/PPGC, Porto Alegre 91501-970, RS, Brazil.

LIG-ERODS, Université Grenoble Alpes, 38058 Grenoble, France.

出版信息

Sensors (Basel). 2022 Jun 23;22(13):4756. doi: 10.3390/s22134756.

DOI:10.3390/s22134756
PMID:35808249
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9269592/
Abstract

A significant rise in the adoption of streaming applications has changed the decision-making processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, and others. Spark Streaming, a widespread open-source implementation, processes data-intensive applications that often require large amounts of memory. However, Spark Unified Memory Manager cannot properly manage sudden or intensive data surges and their related in-memory caching needs, resulting in performance and throughput degradation, high latency, a large number of garbage collection operations, out-of-memory issues, and data loss. This work presents a comprehensive performance evaluation of Spark Streaming backpressure to investigate the hypothesis that it could support data-intensive pipelines under specific pressure requirements. The results reveal that backpressure is suitable only for small and medium pipelines for stateless and stateful applications. Furthermore, it points out the Spark Streaming limitations that lead to in-memory-based issues for data-intensive pipelines and stateful applications. In addition, the work indicates potential solutions.

摘要

流媒体应用的广泛采用改变了过去十年中的决策过程。这一趋势催生了多种用于内存处理的大数据技术,如 Apache Storm、Spark、Heron、Samza、Flink 等系统。Spark Streaming 是一种广泛使用的开源实现,用于处理数据密集型应用程序,这些应用程序通常需要大量内存。然而,Spark 统一内存管理器无法妥善管理突发或密集型数据激增及其相关的内存缓存需求,从而导致性能和吞吐量下降、高延迟、大量垃圾收集操作、内存溢出问题和数据丢失。本工作对 Spark Streaming 的反向压力进行了全面的性能评估,以验证其在特定压力要求下支持数据密集型管道的假设。结果表明,反向压力仅适用于无状态和有状态应用程序的中小规模管道。此外,它还指出了导致数据密集型管道和有状态应用程序出现基于内存问题的 Spark Streaming 局限性。此外,本工作还提出了一些潜在的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/a4bef54edb4c/sensors-22-04756-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/929c3e0e1fb3/sensors-22-04756-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/1b3e8b4d07aa/sensors-22-04756-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/72f7d8810a7f/sensors-22-04756-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/5ae31ff28774/sensors-22-04756-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/d4ad6f3710fd/sensors-22-04756-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/20a833a71f6f/sensors-22-04756-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/379234877441/sensors-22-04756-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/41c93b735a23/sensors-22-04756-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/a4bef54edb4c/sensors-22-04756-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/929c3e0e1fb3/sensors-22-04756-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/1b3e8b4d07aa/sensors-22-04756-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/72f7d8810a7f/sensors-22-04756-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/5ae31ff28774/sensors-22-04756-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/d4ad6f3710fd/sensors-22-04756-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/20a833a71f6f/sensors-22-04756-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/379234877441/sensors-22-04756-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/41c93b735a23/sensors-22-04756-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/376b/9269592/a4bef54edb4c/sensors-22-04756-g009.jpg

相似文献

1
Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines.数据密集型管道的 Spark 流反向压力性能评估分析。
Sensors (Basel). 2022 Jun 23;22(13):4756. doi: 10.3390/s22134756.
2
MaRe: Processing Big Data with application containers on Apache Spark.MaRe:在 Apache Spark 上使用应用程序容器处理大数据。
Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.
3
Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据:Apache Spark 与 MPI。
PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.
4
A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。
PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.
5
SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing.SparkGIS:资源感知型高效内存空间查询处理
Proc ACM SIGSPATIAL Int Conf Adv Inf. 2017 Nov;2017.
6
Efficient State Management for Scaling Out Stateful Operators in Stream Processing Systems.流处理系统中扩展有状态运算符的高效状态管理。
Big Data. 2019 Sep;7(3):192-206. doi: 10.1089/big.2018.0093. Epub 2019 Apr 17.
7
Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud.高斯自适应马尔可夫模型与基于大数据波动分析的云重构模型。
Big Data. 2024 Feb;12(1):1-18. doi: 10.1089/big.2023.0035. Epub 2023 Oct 30.
8
A new Apache Spark-based framework for big data streaming forecasting in IoT networks.一种基于Apache Spark的用于物联网网络大数据流预测的新框架。
J Supercomput. 2023;79(10):11078-11100. doi: 10.1007/s11227-023-05100-x. Epub 2023 Feb 21.
9
A Novel Intelligent Hybrid Optimized Analytics and Streaming Engine for Medical Big Data.一种用于医疗大数据的新型智能混合优化分析和流引擎。
Comput Math Methods Med. 2022 Mar 17;2022:7120983. doi: 10.1155/2022/7120983. eCollection 2022.
10
SPSC: Stream Processing Framework Atop Serverless Computing for Industrial Big Data.SPSC:用于工业大数据的基于无服务器计算的流处理框架
IEEE Trans Cybern. 2024 Nov;54(11):6509-6517. doi: 10.1109/TCYB.2024.3407886. Epub 2024 Oct 30.

引用本文的文献

1
Review of open-source software for developing heterogeneous data management systems for bioinformatics applications.用于生物信息学应用开发异构数据管理系统的开源软件综述。
Bioinform Adv. 2025 Jul 18;5(1):vbaf168. doi: 10.1093/bioadv/vbaf168. eCollection 2025.

本文引用的文献

1
ENERDGE: Distributed Energy-Aware Resource Allocation at the Edge.ENERDGE:边缘端的分布式能源感知资源分配
Sensors (Basel). 2022 Jan 15;22(2):660. doi: 10.3390/s22020660.
2
An Algorithm to Minimize Energy Consumption and Elapsed Time for IoT Workloads in a Hybrid Architecture.一种用于混合架构中物联网工作负载的能耗和运行时间最小化算法。
Sensors (Basel). 2021 Apr 21;21(9):2914. doi: 10.3390/s21092914.