Suppr超能文献

使用预训练模型预测蛋白质突变序列的疾病风险

Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model.

作者信息

Li Kuan, Zhong Yue, Lin Xuan, Quan Zhe

机构信息

School of Cyberspace Security, Dongguan University of Technology, Guangdong, China.

Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen, China.

出版信息

Front Genet. 2020 Dec 21;11:605620. doi: 10.3389/fgene.2020.605620. eCollection 2020.

Abstract

Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predicts the disease risk for the missense mutation of proteins. Our framework is able to learn sequence representations from the protein domain through pre-training BERT models, and also integrates with the hydrophilic properties of amino acids to obtain the sequence representations of biochemical characteristics. The concatenation of two learned representations are then sent to the classifier to predict the missense mutations of protein sequences. Specifically, we use the protein family database (Pfam) as a corpus to train the BERT model to learn the contextual information of protein sequences, and our pre-training BERT model achieves a value of 0.984 on accuracy in the masked language model prediction task. We conduct extensive experiments on BRCA1 and PTEN datasets. With comparison to the baselines, results show that BertVS achieves higher performance of 0.920 on AUROC and 0.915 on AUPR in the functionally critical domain of the BRCA1 gene. Additionally, the extended experiment on the ClinVar dataset can illustrate that gene variants with known clinical significance can also be efficiently classified by our method. Therefore, BertVS can learn the functional information of the protein sequences and effectively predict the disease risk of variants with an uncertain clinical significance.

摘要

准确识别错义突变有助于减轻蛋白质功能丧失和结构变化,这可能会大大降低肿瘤抑制基因(如BRCA1和PTEN)的疾病风险。在本文中,我们提出了一种名为BertVS的混合框架,用于预测蛋白质错义突变的疾病风险。我们的框架能够通过预训练的BERT模型从蛋白质结构域学习序列表示,还整合了氨基酸的亲水性以获得生化特征的序列表示。然后将两种学习到的表示连接起来,送入分类器以预测蛋白质序列的错义突变。具体来说,我们使用蛋白质家族数据库(Pfam)作为语料库来训练BERT模型,以学习蛋白质序列的上下文信息,并且我们的预训练BERT模型在掩码语言模型预测任务中的准确率达到了0.984。我们在BRCA1和PTEN数据集上进行了广泛的实验。与基线相比,结果表明,在BRCA1基因的功能关键域中,BertVS在AUROC上达到了0.920的更高性能,在AUPR上达到了0.915。此外,在ClinVar数据集上的扩展实验表明,我们的方法也可以有效地对具有已知临床意义的基因变异进行分类。因此,BertVS可以学习蛋白质序列的功能信息,并有效地预测具有不确定临床意义的变异的疾病风险。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验