• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于历史数据预测生物信息学工具的运行时间:Galaxy 使用的五年。

Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.

机构信息

Huck Institute of Life Sciences, Neuroscience Program, The Pennsylvania State University, University Park, USA.

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.

出版信息

Bioinformatics. 2019 Sep 15;35(18):3453-3460. doi: 10.1093/bioinformatics/btz054.

DOI:10.1093/bioinformatics/btz054
PMID:30698642
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6931352/
Abstract

MOTIVATION

One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation.

RESULTS

Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation.

AVAILABILITY AND IMPLEMENTATION

Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在大规模调度生物信息学分析时,面临的众多技术挑战之一是确定适当的内存和处理资源量。过度分配和不足分配都会导致计算基础设施的低效使用。过度分配会锁定本可用于其他分析的资源。不足分配会导致作业失败,并需要使用更大的内存或运行时允许量重复分析。我们通过使用在 Galaxy 平台上运行的生物信息学分析的历史数据集来解决此挑战,展示了用于资源需求估计的在线服务的可行性。

结果

在这里,我们介绍了 Galaxy 作业运行数据集,并在资源使用预测任务上测试了流行的机器学习模型。我们包括三种流行的森林模型:随机森林回归器、梯度提升回归器和极端随机树回归器,并发现随机森林在运行时间预测任务中表现最佳。我们还提出了两种选择以前未见作业的 walltime 的方法。分位数回归森林在预测方面更加准确,并可以通过更改估计的置信度来提高性能。但是,置信区间的大小是可变的,并且不能绝对限制。随机森林分类器通过提供对预测间隔大小的控制来解决此问题,其准确性可与回归器相媲美。我们表明,使用相同的方法可以估计作业的内存需求,据我们所知,这以前尚未完成。这种估计对于准确的资源分配非常有益。

可用性和实现

源代码可在 https://github.com/atyryshkina/algorithm-performance-analysis 上获得,用 Python 实现。

补充信息

补充数据可在生物信息学在线获得。

相似文献

1
Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.基于历史数据预测生物信息学工具的运行时间:Galaxy 使用的五年。
Bioinformatics. 2019 Sep 15;35(18):3453-3460. doi: 10.1093/bioinformatics/btz054.
2
Sequence database versioning for command line and Galaxy bioinformatics servers.用于命令行和Galaxy生物信息学服务器的序列数据库版本控制。
Bioinformatics. 2016 Apr 15;32(8):1275-7. doi: 10.1093/bioinformatics/btv724. Epub 2015 Dec 12.
3
Towards reliable named entity recognition in the biomedical domain.迈向生物医学领域可靠的命名实体识别
Bioinformatics. 2020 Jan 1;36(1):280-286. doi: 10.1093/bioinformatics/btz504.
4
TAIJI: approaching experimental replicates-level accuracy for drug synergy prediction.太极:接近药物协同作用预测的实验重复水平的准确性。
Bioinformatics. 2019 Jul 1;35(13):2338-2339. doi: 10.1093/bioinformatics/bty955.
5
Fast bootstrapping-based estimation of confidence intervals of expression levels and differential expression from RNA-Seq data.基于快速自举的 RNA-Seq 数据表达水平和差异表达的置信区间估计。
Bioinformatics. 2017 Oct 15;33(20):3302-3304. doi: 10.1093/bioinformatics/btx365.
6
A new approach for interpreting Random Forest models and its application to the biology of ageing.一种解释随机森林模型的新方法及其在衰老生物学中的应用。
Bioinformatics. 2018 Jul 15;34(14):2449-2456. doi: 10.1093/bioinformatics/bty087.
7
Mycorrhiza: genotype assignment using phylogenetic networks.菌根:基于系统发育网络的基因型分配。
Bioinformatics. 2020 Jan 1;36(1):212-220. doi: 10.1093/bioinformatics/btz476.
8
GalaxyCloudRunner: enhancing scalable computing for Galaxy.银河云跑者:增强 Galaxy 的可扩展计算能力。
Bioinformatics. 2021 Jul 19;37(12):1763-1765. doi: 10.1093/bioinformatics/btaa860.
9
Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。
Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.
10
The revival of the Gini importance?基尼重要性的复兴?
Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

引用本文的文献

1
Container Profiler: Profiling resource utilization of containerized big data pipelines.容器分析器:分析容器化大数据管道的资源利用情况。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad069. Epub 2023 Aug 25.
2
AMPRO-HPCC: A Machine-Learning Tool for Predicting Resources on Slurm HPC Clusters.AMPRO-HPCC:一种用于预测Slurm高性能计算集群资源的机器学习工具。
ADVCOMP Int Conf Adv Eng Comput Appl Sci. 2021 Oct;2021:20-27.
3
A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns.一种利用 CIS 调控元件模式识别 DNA 增强子区域的机器学习技术。
Sci Rep. 2022 Sep 7;12(1):15183. doi: 10.1038/s41598-022-19099-3.
4
DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features.DNAPred_Prot:利用基于组成和位置的特征识别DNA结合蛋白。
Appl Bionics Biomech. 2022 Apr 13;2022:5483115. doi: 10.1155/2022/5483115. eCollection 2022.
5
Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.用于基于Slurm的高性能计算系统以提高系统性能的作业资源集成预测
Pract Exp Adv Res Comput (2021). 2021 Jul;2021. doi: 10.1145/3437359.3465574. Epub 2021 Jul 17.
6
GalaxyCloudRunner: enhancing scalable computing for Galaxy.银河云跑者:增强 Galaxy 的可扩展计算能力。
Bioinformatics. 2021 Jul 19;37(12):1763-1765. doi: 10.1093/bioinformatics/btaa860.
7
Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.积累基因组数据分析工作流程的计算资源使用情况,以优化云计算实例选择。
Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz052.

本文引用的文献

1
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.用于可访问、可重复和协作式生物医学分析的Galaxy平台:2016年更新
Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10. doi: 10.1093/nar/gkw343. Epub 2016 May 2.
2
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Galaxy:一种支持生命科学领域可访问、可重现和透明计算研究的综合方法。
Genome Biol. 2010;11(8):R86. doi: 10.1186/gb-2010-11-8-r86. Epub 2010 Aug 25.
3
Galaxy: a web-based genome analysis tool for experimentalists.Galaxy:一款面向实验人员的基于网络的基因组分析工具。
Curr Protoc Mol Biol. 2010 Jan;Chapter 19:Unit 19.10.1-21. doi: 10.1002/0471142727.mb1910s89.