• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

解读“自然语言”:一种基于Transformer的蛋白质有害突变语言模型。

Deciphering "the language of nature": A transformer-based language model for deleterious mutations in proteins.

作者信息

Jiang Theodore T, Fang Li, Wang Kai

机构信息

Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.

Palisades Charter High School, Pacific Palisades, CA 90272, USA.

出版信息

Innovation (Camb). 2023 Jul 27;4(5):100487. doi: 10.1016/j.xinn.2023.100487. eCollection 2023 Sep 11.

DOI:10.1016/j.xinn.2023.100487
PMID:37636282
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10448337/
Abstract

Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the field of natural language processing show that transformer models-a type of deep neural network-to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer's performance on multiple testing datasets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. In conclusion, MutFormer considers sequence features that are not explored in previous studies and can complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.

摘要

包括深度神经网络模型在内的各种机器学习模型已经被开发出来,用于预测错义(非同义)突变的有害性。然而,对当前技术水平的潜在改进可能仍受益于使用更复杂的自适应机器学习方法重新审视这一生物学问题。自然语言处理领域的最新进展表明,Transformer模型(一种深度神经网络)在对具有上下文依赖性的序列信息进行建模方面特别强大。在本研究中,我们介绍了MutFormer,这是一种基于Transformer的模型,用于预测有害错义突变,它使用来自人类基因组的参考和突变蛋白质序列作为主要特征。MutFormer利用自注意力层和卷积层的组合来学习蛋白质序列中氨基酸突变之间的长程和短程依赖性。我们首先在参考蛋白质序列和人类群体中观察到的常见遗传变异产生的突变蛋白质序列上对MutFormer进行预训练。接下来,我们研究了不同的微调方法,以成功地将该模型应用于错义突变的有害性预测。最后,我们在多个测试数据集上评估了MutFormer的性能。我们发现,与包括那些使用传统机器学习方法的工具在内的各种现有工具相比,MutFormer表现出相似或更好的性能。总之,MutFormer考虑了先前研究中未探索的序列特征,可以补充现有的计算预测或凭经验生成的功能评分,以增进我们对疾病变异的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/750d31e342db/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/08d766a98737/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/dc652530eb7c/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/0b71baf8b323/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/58221cf6fd6e/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/1a74fa28f628/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/750d31e342db/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/08d766a98737/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/dc652530eb7c/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/0b71baf8b323/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/58221cf6fd6e/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/1a74fa28f628/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f58/10448337/750d31e342db/gr5.jpg

相似文献

1
Deciphering "the language of nature": A transformer-based language model for deleterious mutations in proteins.解读“自然语言”:一种基于Transformer的蛋白质有害突变语言模型。
Innovation (Camb). 2023 Jul 27;4(5):100487. doi: 10.1016/j.xinn.2023.100487. eCollection 2023 Sep 11.
2
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
3
A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.基于 BERT 和二维卷积神经网络的变压器架构,用于从序列信息中识别 DNA 增强子。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.
4
HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction.HN-PPISP:一种基于MLP-Mixer的用于蛋白质-蛋白质相互作用位点预测的混合网络。
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac480.
5
Smartphone Sensor-Based Human Motion Characterization with Neural Stochastic Differential Equations and Transformer Model.基于智能手机传感器的人类运动特征刻画:神经随机微分方程与 Transformer 模型
Sensors (Basel). 2022 Oct 2;22(19):7480. doi: 10.3390/s22197480.
6
Transformer-based deep learning for predicting protein properties in the life sciences.基于 Transformer 的深度学习在生命科学中预测蛋白质性质。
Elife. 2023 Jan 18;12:e82819. doi: 10.7554/eLife.82819.
7
ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations.ELASPIC2(EL2):结合语境化语言模型和图神经网络来预测突变的影响。
J Mol Biol. 2021 May 28;433(11):166810. doi: 10.1016/j.jmb.2021.166810. Epub 2021 Jan 13.
8
SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants.SHINE:基于蛋白质语言模型的短移码插入和缺失变异致病性预测。
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac584.
9
Comparison of different functional prediction scores using a gene-based permutation model for identifying cancer driver genes.基于基因排列模型的不同功能预测评分比较,用于鉴定癌症驱动基因。
BMC Med Genomics. 2019 Jan 31;12(Suppl 1):22. doi: 10.1186/s12920-018-0452-9.
10
Wearable Sensor-Based Human Activity Recognition with Transformer Model.基于可穿戴传感器的Transformer 模型人体活动识别。
Sensors (Basel). 2022 Mar 1;22(5):1911. doi: 10.3390/s22051911.

引用本文的文献

1
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
2
Assessing variant effect predictors and disease mechanisms in intrinsically disordered proteins.评估内在无序蛋白质中的变异效应预测因子和疾病机制。
PLoS Comput Biol. 2025 Aug 19;21(8):e1013400. doi: 10.1371/journal.pcbi.1013400. eCollection 2025 Aug.
3
Prediction of pathogenic mutations in human transmembrane proteins and their associated diseases via utilizing pre-trained Bio-LLMs.

本文引用的文献

1
High-throughput deep learning variant effect prediction with Sequence UNET.高通量深度学习变体效应预测与序列 UNET。
Genome Biol. 2023 May 9;24(1):110. doi: 10.1186/s13059-023-02948-3.
2
Improved pathogenicity prediction for rare human missense variants.改善对罕见人类错义变体的致病性预测。
Am J Hum Genet. 2021 Dec 2;108(12):2389. doi: 10.1016/j.ajhg.2021.11.010.
3
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
利用预训练的生物语言模型预测人类跨膜蛋白中的致病突变及其相关疾病。
Commun Biol. 2025 Jul 15;8(1):1050. doi: 10.1038/s42003-025-08452-7.
4
Ge-SAND: an explainable deep learning-driven framework for disease risk prediction by uncovering complex genetic interactions in parallel.Ge-SAND:一个通过并行揭示复杂基因相互作用来进行疾病风险预测的可解释深度学习驱动框架。
BMC Genomics. 2025 May 1;26(1):432. doi: 10.1186/s12864-025-11588-9.
5
Variant effect predictor correlation with functional assays is reflective of clinical classification performance.变异效应预测器与功能测定的相关性反映了临床分类性能。
Genome Biol. 2025 Apr 22;26(1):104. doi: 10.1186/s13059-025-03575-w.
6
Protein structure prediction via deep learning: an in-depth review.基于深度学习的蛋白质结构预测:深入综述
Front Pharmacol. 2025 Apr 3;16:1498662. doi: 10.3389/fphar.2025.1498662. eCollection 2025.
7
XGBMUT: Predicting the Functional Impact of Missense Mutations Using an Extreme Gradient Boost Classifier.XGBMUT:使用极端梯度提升分类器预测错义突变的功能影响。
ACS Omega. 2025 Feb 19;10(8):8349-8360. doi: 10.1021/acsomega.4c10179. eCollection 2025 Mar 4.
8
The pharmacogenomic landscape in the Chinese: An analytics of pharmacogenetic variants in 206,640 individuals.中国人的药物基因组学概况:对206,640名个体的药物遗传变异分析
Innovation (Camb). 2025 Jan 18;6(2):100773. doi: 10.1016/j.xinn.2024.100773. eCollection 2025 Feb 3.
9
Paying attention to the SARS-CoV-2 dialect : a deep neural network approach to predicting novel protein mutations.关注新冠病毒变体:一种预测新型蛋白质突变的深度神经网络方法。
Commun Biol. 2025 Jan 21;8(1):98. doi: 10.1038/s42003-024-07262-7.
10
Deep learning for cross-region streamflow and flood forecasting at a global scale.用于全球尺度跨区域径流和洪水预测的深度学习
Innovation (Camb). 2024 Mar 26;5(3):100617. doi: 10.1016/j.xinn.2024.100617. eCollection 2024 May 6.
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
4
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
5
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
6
MVP predicts the pathogenicity of missense variants by deep learning.MVP 通过深度学习预测错义变异的致病性。
Nat Commun. 2021 Jan 21;12(1):510. doi: 10.1038/s41467-020-20847-0.
7
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs.dbNSFP v4:一个全面的人类非同义突变和剪接位点 SNVs 转录体特异性功能预测和注释数据库。
Genome Med. 2020 Dec 2;12(1):103. doi: 10.1186/s13073-020-00803-9.
8
The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting.人类基因突变数据库(HGMD):优化其在临床诊断或研究环境中的使用。
Hum Genet. 2020 Oct;139(10):1197-1207. doi: 10.1007/s00439-020-02199-3. Epub 2020 Jun 28.
9
The mutational constraint spectrum quantified from variation in 141,456 humans.从 141456 名人类个体的变异中量化的突变约束谱。
Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27.
10
ClinVar: improvements to accessing data.ClinVar:访问数据的改进。
Nucleic Acids Res. 2020 Jan 8;48(D1):D835-D844. doi: 10.1093/nar/gkz972.