Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA 94720, USA.
Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Brunswick, Germany.
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad081.
Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear.
The Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks.
The GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7).
Supplementary data are available at Bioinformatics online.
基因注释是将蛋白质映射到它们的功能(表示为基因本体论 (GO) 术语)的问题,通常基于主要序列进行推断。基因注释是一个多标签多类分类问题,由于其在对具有未知功能的数百万种蛋白质进行特征描述方面的用途,因此引起了越来越多的关注。然而,生物信息学社区中没有用于基准测试新开发的机器学习模型的标准 GO 数据集。因此,这些模型的改进意义尚不清楚。
Gene Benchmarking 数据库是第一个提供易于使用和可配置的中心的努力,用于学习和评估基因注释模型。它可以方便地访问预定义的数据集,并使用 Web 界面根据自定义预设通过预处理和过滤所有数据来采取非平凡的步骤。GO bench 网络应用程序还可用于在排行榜上评估和显示任何经过训练的模型,以进行注释任务。
GO Benchmarking 数据集可在 www.gobench.org 上免费获得。代码托管在 github.com/mofradlab 上,有网站代码、核心实用程序和使用示例的存储库(补充部分 S.7)。
补充数据可在 Bioinformatics 在线获得。