• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

鸟嘌呤v1.0:用于基因组人工智能序列到功能模型的基准数据集。

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models.

作者信息

Robson Eyes S, Ioannidis Nilah M

机构信息

Center for Computational Biology, UC Berkeley, Berkeley, CA 94720.

Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720.

出版信息

bioRxiv. 2024 Mar 7:2023.10.12.562113. doi: 10.1101/2023.10.12.562113.

DOI:10.1101/2023.10.12.562113
PMID:37904945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10614795/
Abstract

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.

摘要

计算基因组学越来越依赖机器学习方法来进行基因组解读,而最近采用的神经序列到功能模型凸显了对严格模型规范和可控评估的需求,这些问题在人工智能的其他领域也很常见。推动基因组人工智能领域发展并促进模型开发,还需要借鉴那些在其他领域取得巨大成功的研究策略,包括基准测试、审计和算法公平性。在此,我们提出了一个基因组人工智能基准GUANinE,用于评估模型在多个不同基因组任务上的泛化能力。与计算基因组学中现有的任务形式相比,GUANinE规模更大、经过去噪处理,适用于评估预训练模型。GUANinE v1.0主要关注功能基因组学任务,如功能元件注释和基因表达预测,并且还通过序列保守性任务与进化生物学建立联系。当前的GUANinE任务为了解现有基因组人工智能模型和非神经基线的性能提供了依据,随着该领域的成熟,还有机会对其进行完善、重新审视和拓展。最后,GUANinE基准使我们能够评估新的自监督T5模型,并探索词元化和模型性能之间的权衡,同时展示自监督对补充现有预训练程序的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4831/10926752/6100a4c80926/nihpp-2023.10.12.562113v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4831/10926752/da9b7599165f/nihpp-2023.10.12.562113v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4831/10926752/6100a4c80926/nihpp-2023.10.12.562113v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4831/10926752/da9b7599165f/nihpp-2023.10.12.562113v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4831/10926752/6100a4c80926/nihpp-2023.10.12.562113v3-f0001.jpg

相似文献

1
GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models.鸟嘌呤v1.0:用于基因组人工智能序列到功能模型的基准数据集。
bioRxiv. 2024 Mar 7:2023.10.12.562113. doi: 10.1101/2023.10.12.562113.
2
DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景:对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。
Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.
3
Large-scale benchmarking and boosting transfer learning for medical image analysis.用于医学图像分析的大规模基准测试与增强迁移学习
Med Image Anal. 2025 May;102:103487. doi: 10.1016/j.media.2025.103487. Epub 2025 Feb 21.
4
Genomic benchmarks: a collection of datasets for genomic sequence classification.基因组基准测试:一组用于基因组序列分类的数据集。
BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.
5
6
Self-supervised driven consistency training for annotation efficient histopathology image analysis.用于高效标注组织病理学图像分析的自监督驱动一致性训练
Med Image Anal. 2022 Jan;75:102256. doi: 10.1016/j.media.2021.102256. Epub 2021 Oct 13.
7
Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.
8
A scoping review of fair machine learning techniques when using real-world data.使用真实世界数据时公平机器学习技术的范围综述。
J Biomed Inform. 2024 Mar;151:104622. doi: 10.1016/j.jbi.2024.104622. Epub 2024 Mar 6.
9
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.用于人类遗传学中因果调控变异预测的DNA序列模型基准测试
bioRxiv. 2025 Mar 4:2025.02.11.637758. doi: 10.1101/2025.02.11.637758.
10
RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景:任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述
Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

本文引用的文献

1
Hold out the genome: a roadmap to solving the cis-regulatory code.伸出基因组:解决顺式调控代码的路线图。
Nature. 2024 Jan;625(7993):41-50. doi: 10.1038/s41586-023-06661-w. Epub 2023 Dec 13.
2
Genomic benchmarks: a collection of datasets for genomic sequence classification.基因组基准测试:一组用于基因组序列分类的数据集。
BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.
3
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers.目前基于序列的模型可以捕捉启动子中的基因表达决定因素,但大多忽略了远端增强子。
Genome Biol. 2023 Mar 27;24(1):56. doi: 10.1186/s13059-023-02899-9.
4
The genetic and biochemical determinants of mRNA degradation rates in mammals.哺乳动物中 mRNA 降解速率的遗传和生化决定因素。
Genome Biol. 2022 Nov 23;23(1):245. doi: 10.1186/s13059-022-02811-x.
5
Promoter sequence and architecture determine expression variability and confer robustness to genetic variants.启动子序列和结构决定表达的可变性,并赋予遗传变异的稳健性。
Elife. 2022 Nov 15;11:e80943. doi: 10.7554/eLife.80943.
6
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.基于深度神经网络的 DNA 序列分类研究:超越序列相似性的分类方法
Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.
7
Gene-environment correlations across geographic regions affect genome-wide association studies.基因-环境相关性在地理区域上的差异会影响全基因组关联研究。
Nat Genet. 2022 Sep;54(9):1345-1354. doi: 10.1038/s41588-022-01158-0. Epub 2022 Aug 22.
8
An empirical evaluation of sampling methods for the classification of imbalanced data.不平衡数据分类的采样方法的实证评估。
PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.
9
The evolution, evolvability and engineering of gene regulatory DNA.基因调控 DNA 的进化、可进化性与工程。
Nature. 2022 Mar;603(7901):455-463. doi: 10.1038/s41586-022-04506-6. Epub 2022 Mar 9.
10
Microchromosomes are building blocks of bird, reptile, and mammal chromosomes.微染色体是鸟类、爬行类和哺乳类染色体的组成部分。
Proc Natl Acad Sci U S A. 2021 Nov 9;118(45). doi: 10.1073/pnas.2112494118.