Suppr超能文献

解读“自然语言”:一种基于Transformer的蛋白质有害突变语言模型。

Deciphering "the language of nature": A transformer-based language model for deleterious mutations in proteins.

作者信息

Jiang Theodore T, Fang Li, Wang Kai

机构信息

Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.

Palisades Charter High School, Pacific Palisades, CA 90272, USA.

出版信息

Innovation (Camb). 2023 Jul 27;4(5):100487. doi: 10.1016/j.xinn.2023.100487. eCollection 2023 Sep 11.

Abstract

Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the field of natural language processing show that transformer models-a type of deep neural network-to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer's performance on multiple testing datasets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. In conclusion, MutFormer considers sequence features that are not explored in previous studies and can complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.

摘要

包括深度神经网络模型在内的各种机器学习模型已经被开发出来,用于预测错义(非同义)突变的有害性。然而,对当前技术水平的潜在改进可能仍受益于使用更复杂的自适应机器学习方法重新审视这一生物学问题。自然语言处理领域的最新进展表明,Transformer模型(一种深度神经网络)在对具有上下文依赖性的序列信息进行建模方面特别强大。在本研究中,我们介绍了MutFormer,这是一种基于Transformer的模型,用于预测有害错义突变,它使用来自人类基因组的参考和突变蛋白质序列作为主要特征。MutFormer利用自注意力层和卷积层的组合来学习蛋白质序列中氨基酸突变之间的长程和短程依赖性。我们首先在参考蛋白质序列和人类群体中观察到的常见遗传变异产生的突变蛋白质序列上对MutFormer进行预训练。接下来,我们研究了不同的微调方法,以成功地将该模型应用于错义突变的有害性预测。最后,我们在多个测试数据集上评估了MutFormer的性能。我们发现,与包括那些使用传统机器学习方法的工具在内的各种现有工具相比,MutFormer表现出相似或更好的性能。总之,MutFormer考虑了先前研究中未探索的序列特征,可以补充现有的计算预测或凭经验生成的功能评分,以增进我们对疾病变异的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/08d766a98737/fx1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验