Institute of Informatics, Federal University of Rio Grande do Sul, UFRGS/PPGC, Porto Alegre 91501-970, RS, Brazil.
LIG-ERODS, Université Grenoble Alpes, 38058 Grenoble, France.
Sensors (Basel). 2022 Jun 23;22(13):4756. doi: 10.3390/s22134756.
A significant rise in the adoption of streaming applications has changed the decision-making processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, and others. Spark Streaming, a widespread open-source implementation, processes data-intensive applications that often require large amounts of memory. However, Spark Unified Memory Manager cannot properly manage sudden or intensive data surges and their related in-memory caching needs, resulting in performance and throughput degradation, high latency, a large number of garbage collection operations, out-of-memory issues, and data loss. This work presents a comprehensive performance evaluation of Spark Streaming backpressure to investigate the hypothesis that it could support data-intensive pipelines under specific pressure requirements. The results reveal that backpressure is suitable only for small and medium pipelines for stateless and stateful applications. Furthermore, it points out the Spark Streaming limitations that lead to in-memory-based issues for data-intensive pipelines and stateful applications. In addition, the work indicates potential solutions.
流媒体应用的广泛采用改变了过去十年中的决策过程。这一趋势催生了多种用于内存处理的大数据技术,如 Apache Storm、Spark、Heron、Samza、Flink 等系统。Spark Streaming 是一种广泛使用的开源实现,用于处理数据密集型应用程序,这些应用程序通常需要大量内存。然而,Spark 统一内存管理器无法妥善管理突发或密集型数据激增及其相关的内存缓存需求,从而导致性能和吞吐量下降、高延迟、大量垃圾收集操作、内存溢出问题和数据丢失。本工作对 Spark Streaming 的反向压力进行了全面的性能评估,以验证其在特定压力要求下支持数据密集型管道的假设。结果表明,反向压力仅适用于无状态和有状态应用程序的中小规模管道。此外,它还指出了导致数据密集型管道和有状态应用程序出现基于内存问题的 Spark Streaming 局限性。此外,本工作还提出了一些潜在的解决方案。