School of Cyber Security and Computer, Hebei University, Baoding 071000, China.
Sensors (Basel). 2022 Aug 8;22(15):5930. doi: 10.3390/s22155930.
Apache Spark is a popular open-source distributed data processing framework that can efficiently process massive amounts of data. It provides more than 180 configuration parameters for users to manually select the appropriate parameter values according to their own experience. However, due to the large number of parameters and the inherent correlation between them, manual tuning is very tedious. To solve the problem of tuning through personal experience, we designed and implemented a reinforcement-learning-based Spark configuration parameter optimizer. First, we trained a Spark application performance prediction model with deep neural networks, and verified the accuracy and effectiveness of the model from multiple perspectives. Second, in order to improve the search efficiency of better configuration parameters, we improved the Q-learning algorithm, and automatically set start and end states in each iteration of training, which effectively improves the agent's poor performance in exploring better configuration parameters. Lastly, comparing our proposed configuration with the default configuration as the baseline, experimental results show that the optimized configuration gained an average performance improvement of 47%, 43%, 31%, and 45% for four different types of Spark applications, which indicates that our Spark configuration parameter optimizer could efficiently find the better configuration parameters and improve the performance of various Spark applications.
Apache Spark 是一个流行的开源分布式数据处理框架,可以高效地处理大量数据。它为用户提供了超过 180 个配置参数,用户可以根据自己的经验手动选择合适的参数值。但是,由于参数数量众多且它们之间存在固有相关性,手动调整非常繁琐。为了解决通过个人经验进行调整的问题,我们设计并实现了一个基于强化学习的 Spark 配置参数优化器。首先,我们使用深度神经网络训练了一个 Spark 应用程序性能预测模型,并从多个角度验证了模型的准确性和有效性。其次,为了提高更好配置参数的搜索效率,我们改进了 Q-learning 算法,并在每次训练迭代中自动设置起始和结束状态,这有效地提高了代理在探索更好配置参数方面的性能不佳问题。最后,将我们提出的配置与默认配置作为基准进行比较,实验结果表明,优化后的配置在四种不同类型的 Spark 应用程序中平均性能提升了 47%、43%、31%和 45%,这表明我们的 Spark 配置参数优化器可以有效地找到更好的配置参数并提高各种 Spark 应用程序的性能。