Suppr超能文献

提高运行混合工作负载的大规模云系统的成本效率——以阿里巴巴集群跟踪为例

Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces.

作者信息

Everman Brad, Rajendran Narmadha, Li Xiaomin, Zong Ziliang

机构信息

Texas State University, San Marcos, TX, United States.

出版信息

Sustain Comput. 2021 Jun;30:100528. doi: 10.1016/j.suscom.2021.100528. Epub 2021 Mar 3.

Abstract

The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provide new or significantly improve existing cloud services that can better support hybrid workloads (e.g. online purchase jobs and batch jobs that support ranking or recommendation systems). Successful cloud systems need to efficiently handle and quickly respond to huge volume of traffic with such hybrid workloads. Meanwhile, it is critical to reduce the total cost of ownership (TCO) for profitability. Improving system utilization is one of the effective techniques to achieve the twin goals of high performance and low TCO. This paper conducts a comprehensive analysis on the 2017 and 2018 cluster traces released by Alibaba, which provides a case study about Alibaba's best practices in improving the performance and cost efficiency of its large-scale cloud systems by consolidating time-sensitive online service jobs with time-insensitive batch jobs. Our investigation indicates that the over-subscription (causing resource waste and low utilization) and under-subscription (causing performance degradation) problems co-exist in the current Alibaba system. We develop a simulator that allows us to evaluate possible solutions to address this problem and their impact on the performance, energy consumption, and TCO. Our experiments show that the estimated TCO can be reduced by $600,000 for the 2018 trace running on over 4,000 machines without compromising performance. The TCO can decrease by nearly $68 million if similar strategy is extrapolated to Alibaba's 432,000 web facing servers.

摘要

冠状病毒大流行极大地扰乱了零售业,因为许多商店被迫关闭,世界各地的人们都选择居家隔离,网购成为必然选择。为了满足对电子商务迅速增长的需求,预计会有更多数据中心提供新的云服务或大幅改进现有云服务,以更好地支持混合工作负载(例如在线购买任务以及支持排名或推荐系统的批处理任务)。成功的云系统需要高效处理并快速响应此类混合工作负载带来的大量流量。与此同时,为了实现盈利,降低总体拥有成本(TCO)至关重要。提高系统利用率是实现高性能和低TCO这两个目标的有效技术之一。本文对阿里巴巴发布的2017年和2018年集群跟踪数据进行了全面分析,该分析提供了一个关于阿里巴巴通过整合对时间敏感的在线服务任务和对时间不敏感的批处理任务来提高其大规模云系统性能和成本效率的最佳实践案例研究。我们的调查表明,当前阿里巴巴系统中存在超额订阅(导致资源浪费和低利用率)和订阅不足(导致性能下降)的问题。我们开发了一个模拟器,使我们能够评估解决此问题的可能方案及其对性能、能源消耗和TCO的影响。我们的实验表明,对于在4000多台机器上运行的2018年跟踪数据,在不影响性能的情况下,估计的TCO可以降低60万美元。如果将类似策略推广到阿里巴巴的43.2万台面向Web的服务器,TCO可以减少近6800万美元。

相似文献

7
Randomized routing of virtual machines in IaaS data centers.IaaS数据中心中虚拟机的随机路由
PeerJ Comput Sci. 2019 Sep 2;5:e211. doi: 10.7717/peerj-cs.211. eCollection 2019.
10
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验