Suppr超能文献

基于多物种比对的DNA语言模型可预测全基因组变异的影响。

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants.

作者信息

Benegas Gonzalo, Albors Carlos, Aw Alan J, Ye Chengzhong, Song Yun S

机构信息

Graduate Group in Computational Biology, University of California, Berkeley, CA, US.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, US.

出版信息

Nat Biotechnol. 2025 Jan 2. doi: 10.1038/s41587-024-02511-w.

Abstract

Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.

摘要

蛋白质语言模型在预测错义变体的影响方面表现出色,但DNA语言模型在处理如人类基因组这样的复杂基因组时尚未展现出竞争优势。当处理占人类基因组约98%的非编码区域的巨大复杂性时,这一局限性尤为明显。为应对这一挑战,我们引入了GPN-MSA(具有多序列比对的基因组预训练网络),这是一个利用多个物种的全基因组比对且仅需数小时即可训练的框架。在临床数据库(ClinVar、COSMIC和OMIM)、实验功能测定(深度突变扫描和DepMap)以及群体基因组数据(gnomAD)的多个基准测试中,我们针对人类基因组的模型在编码和非编码变体的有害性预测方面均取得了出色表现。我们为人类基因组中所有约90亿个可能的单核苷酸变体提供了预先计算的分数。我们预计,我们在全基因组变体效应预测方面的进展将实现更准确的罕见病诊断,并改善罕见变体负担测试。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验