Suppr超能文献

卷积和自注意力的融合提高了人类基因组语言模型,以碱基分辨率解释非编码区域。

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.

机构信息

MGI, BGI-Shenzhen, Shenzhen 518083, China.

Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark.

出版信息

Nucleic Acids Res. 2022 Aug 12;50(14):e81. doi: 10.1093/nar/gkac326.

Abstract

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

摘要

由于在所有条件下详尽注释生化活性元件不切实际,非编码基因组的解释仍然是人类遗传学中的一个未解决的挑战。最近出现了基于深度学习的计算方法来帮助解释非编码区域。在这里,我们提出了 LOGO(基因组语言),这是一种基于自注意力的上下文预训练语言模型,仅包含两层自注意力层,参数为 100 万,作为一种实质性的轻量级架构,它应用自我监督技术来学习未标记的人类参考基因组的双向表示。然后,对 LOGO 进行序列标记任务的微调,并通过替代等位基因的特殊输入编码方案进一步扩展到变体优先级任务,然后添加卷积模块。实验表明,LOGO 在启动子识别方面实现了 15%的绝对改进,在增强子-启动子相互作用预测方面实现了高达 4.5%的绝对改进。LOGO 在数千个染色质特征上表现出最先进的多任务预测能力,与完全监督模型 DeepSEA 相比,参数化仅为 3%,与最近的基于 BERT 的 DNA 语言模型相比,参数化仅为 1%。对于等位基因效应预测,一维卷积引入的局部性提高了优先级排序与人类疾病相关的非编码变体的敏感性和特异性。此外,我们应用 LOGO 来解释 2 型糖尿病(T2D)GWAS 信号并推断潜在的调控机制。我们将自然语言和人类基因组进行了概念类比,并证明 LOGO 是一种准确、快速、可扩展和稳健的框架,可以对非编码区域进行全局序列标记以及碱基分辨率的变体优先级排序进行解释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/948f/9371931/2236102876ed/gkac326fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验