Luthra Ishika, Priyadarshi Satyam, Guo Rui, Mahieu Lukas, Kempynck Niklas, Dooley Damion, Penzar Dmitry, Vorontsov Ilya, Sheng Yilun, Tu Xinming, Klie Adam, Drusinsky Shiron, Floren Alexander, Armand Ethan, Alasoo Kaur, Seelig Georg, Tewhey Ryan, Koo Peter, Agarwal Vikram, Gosai Sager, Pinello Luca, White Michael A, Lal Avantika, Zeitlinger Julia, Pollard Katherine S, Libbrecht Maxwell, Carter Hannah, Mostafavi Sara, Kulakovskiy Ivan, Hsiao Will, Aerts Stein, Zhou Jian, de Boer Carl G
School of Biomedical Engineering, University of British Columbia, Vancouver, BC, Canada.
Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA.
bioRxiv. 2025 Jul 8:2025.07.04.663250. doi: 10.1101/2025.07.04.663250.
The rapid expansion of genomics datasets and the application of machine learning has produced sequence-to-activity genomics models with ever-expanding capabilities. However, benchmarking these models on practical applications has been challenging because individual projects evaluate their models in ad hoc ways, and there is substantial heterogeneity of both model architectures and benchmarking tasks. To address this challenge, we have created GAME, a system for large-scale, community-led standardized model benchmarking on user-defined evaluation tasks. We borrow concepts from the Application Programming Interface (API) paradigm to allow for seamless communication between pre-trained models and benchmarking tasks, ensuring consistent evaluation protocols. Because all models and benchmarks are inherently compatible in this framework, the continual addition of new models and new benchmarks is easy. We also developed a Matcher module powered by a large language model (LLM) to automate ambiguous task alignment between benchmarks and models. Containerization of these modules enhances reproducibility and facilitates the deployment of models and benchmarks across computing platforms. By focusing on predicting underlying biochemical phenomena (e.g. gene expression, open chromatin, DNA binding), we ensure that tasks remain technology-independent. We provide examples of benchmarks and models implementing this framework, and anticipate that the community will contribute their own, leading to an ever-expanding and evolving set of models and evaluation tasks. This resource will accelerate genomics research by illuminating the best models for a given task, motivating novel functional genomic benchmarks, and providing a more nuanced understanding of model abilities.
基因组学数据集的迅速扩展以及机器学习的应用催生了功能不断扩展的序列到活性基因组学模型。然而,在实际应用中对这些模型进行基准测试具有挑战性,因为各个项目以临时方式评估其模型,并且模型架构和基准测试任务存在很大的异质性。为应对这一挑战,我们创建了GAME,这是一个用于在用户定义的评估任务上进行大规模、社区主导的标准化模型基准测试的系统。我们借鉴了应用程序编程接口(API)范式的概念,以实现预训练模型与基准测试任务之间的无缝通信,确保评估协议的一致性。由于在此框架中所有模型和基准测试本质上都是兼容的,因此轻松添加新模型和新基准测试很容易。我们还开发了一个由大语言模型(LLM)驱动的匹配器模块,以自动完成基准测试与模型之间模糊的任务对齐。这些模块的容器化提高了可重复性,并便于在计算平台上部署模型和基准测试。通过专注于预测潜在的生化现象(例如基因表达、开放染色质、DNA结合),我们确保任务与技术无关。我们提供了实现此框架的基准测试和模型的示例,并预计社区将贡献自己的内容,从而导致模型和评估任务的不断扩展和发展。这一资源将通过阐明给定任务的最佳模型、激发新的功能基因组学基准测试以及提供对模型能力更细致入微的理解来加速基因组学研究。