Huck Institute of Life Sciences, Neuroscience Program, The Pennsylvania State University, University Park, USA.
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.
Bioinformatics. 2019 Sep 15;35(18):3453-3460. doi: 10.1093/bioinformatics/btz054.
One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation.
Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation.
Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python.
Supplementary data are available at Bioinformatics online.
在大规模调度生物信息学分析时,面临的众多技术挑战之一是确定适当的内存和处理资源量。过度分配和不足分配都会导致计算基础设施的低效使用。过度分配会锁定本可用于其他分析的资源。不足分配会导致作业失败,并需要使用更大的内存或运行时允许量重复分析。我们通过使用在 Galaxy 平台上运行的生物信息学分析的历史数据集来解决此挑战,展示了用于资源需求估计的在线服务的可行性。
在这里,我们介绍了 Galaxy 作业运行数据集,并在资源使用预测任务上测试了流行的机器学习模型。我们包括三种流行的森林模型:随机森林回归器、梯度提升回归器和极端随机树回归器,并发现随机森林在运行时间预测任务中表现最佳。我们还提出了两种选择以前未见作业的 walltime 的方法。分位数回归森林在预测方面更加准确,并可以通过更改估计的置信度来提高性能。但是,置信区间的大小是可变的,并且不能绝对限制。随机森林分类器通过提供对预测间隔大小的控制来解决此问题,其准确性可与回归器相媲美。我们表明,使用相同的方法可以估计作业的内存需求,据我们所知,这以前尚未完成。这种估计对于准确的资源分配非常有益。
源代码可在 https://github.com/atyryshkina/algorithm-performance-analysis 上获得,用 Python 实现。
补充数据可在生物信息学在线获得。