• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用预训练模型预测蛋白质突变序列的疾病风险

Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model.

作者信息

Li Kuan, Zhong Yue, Lin Xuan, Quan Zhe

机构信息

School of Cyberspace Security, Dongguan University of Technology, Guangdong, China.

Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen, China.

出版信息

Front Genet. 2020 Dec 21;11:605620. doi: 10.3389/fgene.2020.605620. eCollection 2020.

DOI:10.3389/fgene.2020.605620
PMID:33408741
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7780924/
Abstract

Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predicts the disease risk for the missense mutation of proteins. Our framework is able to learn sequence representations from the protein domain through pre-training BERT models, and also integrates with the hydrophilic properties of amino acids to obtain the sequence representations of biochemical characteristics. The concatenation of two learned representations are then sent to the classifier to predict the missense mutations of protein sequences. Specifically, we use the protein family database (Pfam) as a corpus to train the BERT model to learn the contextual information of protein sequences, and our pre-training BERT model achieves a value of 0.984 on accuracy in the masked language model prediction task. We conduct extensive experiments on BRCA1 and PTEN datasets. With comparison to the baselines, results show that BertVS achieves higher performance of 0.920 on AUROC and 0.915 on AUPR in the functionally critical domain of the BRCA1 gene. Additionally, the extended experiment on the ClinVar dataset can illustrate that gene variants with known clinical significance can also be efficiently classified by our method. Therefore, BertVS can learn the functional information of the protein sequences and effectively predict the disease risk of variants with an uncertain clinical significance.

摘要

准确识别错义突变有助于减轻蛋白质功能丧失和结构变化,这可能会大大降低肿瘤抑制基因(如BRCA1和PTEN)的疾病风险。在本文中,我们提出了一种名为BertVS的混合框架,用于预测蛋白质错义突变的疾病风险。我们的框架能够通过预训练的BERT模型从蛋白质结构域学习序列表示,还整合了氨基酸的亲水性以获得生化特征的序列表示。然后将两种学习到的表示连接起来,送入分类器以预测蛋白质序列的错义突变。具体来说,我们使用蛋白质家族数据库(Pfam)作为语料库来训练BERT模型,以学习蛋白质序列的上下文信息,并且我们的预训练BERT模型在掩码语言模型预测任务中的准确率达到了0.984。我们在BRCA1和PTEN数据集上进行了广泛的实验。与基线相比,结果表明,在BRCA1基因的功能关键域中,BertVS在AUROC上达到了0.920的更高性能,在AUPR上达到了0.915。此外,在ClinVar数据集上的扩展实验表明,我们的方法也可以有效地对具有已知临床意义的基因变异进行分类。因此,BertVS可以学习蛋白质序列的功能信息,并有效地预测具有不确定临床意义的变异的疾病风险。

相似文献

1
Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model.使用预训练模型预测蛋白质突变序列的疾病风险
Front Genet. 2020 Dec 21;11:605620. doi: 10.3389/fgene.2020.605620. eCollection 2020.
2
Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT).使用基于转换器的双向编码器表示 (BERT) 和领域内预训练 (IDPT) 对耳鸣患者的可操作放射学报告进行自动文本分类。
BMC Med Inform Decis Mak. 2022 Jul 30;22(1):200. doi: 10.1186/s12911-022-01946-y.
3
ProtPlat: an efficient pre-training platform for protein classification based on FastText.ProtPlat:基于 FastText 的高效蛋白质分类预训练平台。
BMC Bioinformatics. 2022 Feb 11;23(1):66. doi: 10.1186/s12859-022-04604-2.
4
Analysis of missense variation in human BRCA1 in the context of interspecific sequence variation.在种间序列变异背景下对人类BRCA1中错义变异的分析。
J Med Genet. 2004 Jul;41(7):492-507. doi: 10.1136/jmg.2003.015867.
5
Umami-BERT: An interpretable BERT-based model for umami peptides prediction.鲜味 BERT:一种基于 BERT 的可解释模型,用于预测鲜味肽。
Food Res Int. 2023 Oct;172:113142. doi: 10.1016/j.foodres.2023.113142. Epub 2023 Jun 16.
6
Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.基于特定任务预训练改进用于DNA-蛋白质结合预测的人类基因组语言模型。
Interdiscip Sci. 2023 Mar;15(1):32-43. doi: 10.1007/s12539-022-00537-9. Epub 2022 Sep 22.
7
TRP-BERT: Discrimination of transient receptor potential (TRP) channels using contextual representations from deep bidirectional transformer based on BERT.TRP-BERT:基于 BERT 的深度双向转换器的上下文表示对瞬时受体电位 (TRP) 通道的判别。
Comput Biol Med. 2021 Oct;137:104821. doi: 10.1016/j.compbiomed.2021.104821. Epub 2021 Sep 1.
8
ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations.ActTRANS:基于迁移学习和上下文表示的主动转运蛋白的功能分类。
Comput Biol Chem. 2021 Aug;93:107537. doi: 10.1016/j.compbiolchem.2021.107537. Epub 2021 Jun 29.
9
LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings.LBCE-XGB:一种基于BERT嵌入的用于预测线性B细胞表位的XGBoost模型。
Interdiscip Sci. 2023 Jun;15(2):293-305. doi: 10.1007/s12539-023-00549-z. Epub 2023 Jan 16.
10
MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins.MutTMPredictor:用于预测跨膜蛋白突变的强大且准确的级联XGBoost分类器。
Comput Struct Biotechnol J. 2021 Nov 19;19:6400-6416. doi: 10.1016/j.csbj.2021.11.024. eCollection 2021.

引用本文的文献

1
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
2
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
3
Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.

本文引用的文献

1
Memristive Circuit Implementation of Biological Nonassociative Learning Mechanism and Its Applications.忆阻电路实现生物非联想学习机制及其应用。
IEEE Trans Biomed Circuits Syst. 2020 Oct;14(5):1036-1050. doi: 10.1109/TBCAS.2020.3018777. Epub 2020 Aug 24.
2
Monodirectional Tissue P Systems With Promoters.具有启动子的单方向组织 P 系统。
IEEE Trans Cybern. 2021 Jan;51(1):438-450. doi: 10.1109/TCYB.2020.3003060. Epub 2020 Dec 22.
3
StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency.
蛋白质适应性预测受到语言模型、集成学习和采样方法相互作用的影响。
Pharmaceutics. 2023 Apr 25;15(5):1337. doi: 10.3390/pharmaceutics15051337.
4
BERT-PPII: The Polyproline Type II Helix Structure Prediction Model Based on BERT and Multichannel CNN.BERT-PPII:基于 BERT 和多通道 CNN 的聚脯氨酸 II 型螺旋结构预测模型。
Biomed Res Int. 2022 Aug 24;2022:9015123. doi: 10.1155/2022/9015123. eCollection 2022.
StackCPPred:基于堆叠和成对能量含量的细胞穿透肽预测及其摄取效率。
Bioinformatics. 2020 May 1;36(10):3028-3034. doi: 10.1093/bioinformatics/btaa131.
4
An Overview on Predicting Protein Subchloroplast Localization by using Machine Learning Methods.基于机器学习方法预测蛋白亚叶绿体定位的研究综述。
Curr Protein Pept Sci. 2020;21(12):1229-1241. doi: 10.2174/1389203721666200117153412.
5
A novel molecular representation with BiGRU neural networks for learning atom.用于学习原子的 BiGRU 神经网络的新型分子表示。
Brief Bioinform. 2020 Dec 1;21(6):2099-2111. doi: 10.1093/bib/bbz125.
6
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
7
A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection.基于共识的粒子群优化的动态社区发现算法。
IEEE Trans Cybern. 2020 Jun;50(6):2502-2513. doi: 10.1109/TCYB.2019.2938895. Epub 2019 Sep 23.
8
Identification of hormone binding proteins based on machine learning methods.基于机器学习方法的激素结合蛋白鉴定
Math Biosci Eng. 2019 Mar 22;16(4):2466-2480. doi: 10.3934/mbe.2019123.
9
deepDR: a network-based deep learning approach to in silico drug repositioning.深度重定位(deepDR):一种基于网络的深度学习方法,用于计算机药物重定位。
Bioinformatics. 2019 Dec 15;35(24):5191-5198. doi: 10.1093/bioinformatics/btz418.
10
Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response.深度呼吸森林:一种用于预测抗癌药物反应的深度森林模型。
Methods. 2019 Aug 15;166:91-102. doi: 10.1016/j.ymeth.2019.02.009. Epub 2019 Feb 14.