跨蛋白迁移学习显著提高了疾病变异体预测的性能。

Cross-protein transfer learning substantially improves disease variant prediction.

机构信息

Computer Science Division, University of California, Berkeley, 94720, CA, USA.

Department of Statistics, University of California, Berkeley, 94720, CA, USA.

出版信息

Genome Biol. 2023 Aug 7;24(1):182. doi: 10.1186/s13059-023-03024-6.

DOI:10.1186/s13059-023-03024-6

PMID:37550700

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10408151/

Abstract

BACKGROUND

Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity.

RESULTS

We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes.

CONCLUSIONS

Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.

摘要

背景

人类基因组中的遗传变异是个体疾病风险的主要决定因素，但绝大多数错义变异的病因作用未知。在这里，我们提出了一种稳健的学习框架，利用饱和诱变实验构建准确的计算预测蛋白质组中错义变异致病性的方法。

结果

我们使用来自仅五个蛋白质的深度突变扫描（DMS）数据来训练跨蛋白质转移（CPT）模型，并在人类蛋白质组中针对未见过的蛋白质的临床变异解释方面达到了最新的性能。我们还提高了对保留蛋白质的 DMS 数据的预测准确性。高灵敏度对于临床应用至关重要，我们的模型 CPT-1 在这方面表现尤为出色。例如，在以 95%的灵敏度检测到 ClinVar 中注释的人类疾病变异时，CPT-1 将特异性提高到 68%，而 ESM-1v 的特异性为 27%，EVE 的特异性为 55%。此外，对于未用于训练 REVEL 的基因，我们展示了 CPT-1 与 REVEL 相比具有优势。我们的框架结合了从一般蛋白质序列模型、脊椎动物序列比对和 AlphaFold 结构中提取的预测特征，并且可以适应未来包括其他信息来源。我们发现，脊椎动物比对虽然只有 100 个基因组，非常浅，但为变异致病性预测提供了一个强有力的信号，这与基于大量蛋白质序列数据训练的最新深度学习模型互补。我们为 90%的人类基因中的所有可能的错义变异提供了预测。

结论

我们的结果证明了突变扫描数据在学习可转移到未见过的蛋白质的变异特性方面的实用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15ac/10408151/f4a1aa1c5694/13059_2023_3024_Fig1_HTML.jpg

相似文献

Cross-protein transfer learning substantially improves disease variant prediction.跨蛋白迁移学习显著提高了疾病变异体预测的性能。

Genome Biol. 2023 Aug 7;24(1):182. doi: 10.1186/s13059-023-03024-6.

Accurate proteome-wide missense variant effect prediction with AlphaMissense.使用 AlphaMissense 进行精确的全蛋白质错义变异效应预测。

Science. 2023 Sep 22;381(6664):eadg7492. doi: 10.1126/science.adg7492.

Enhancing missense variant pathogenicity prediction with protein language models using VariPred.利用 VariPred 利用蛋白质语言模型增强错义变异致病性预测。

Sci Rep. 2024 Apr 7;14(1):8136. doi: 10.1038/s41598-024-51489-7.

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants.评估序列保守性在预测致病性错义变异中的相关性。

Hum Genet. 2022 Oct;141(10):1649-1658. doi: 10.1007/s00439-021-02419-4. Epub 2022 Jan 31.

Variant effect predictions capture some aspects of deep mutational scanning experiments.变异效应预测捕捉到了深度突变扫描实验的一些方面。

BMC Bioinformatics. 2020 Mar 17;21(1):107. doi: 10.1186/s12859-020-3439-4.

Curated multiple sequence alignment for the Adenomatous Polyposis Coli (APC) gene and accuracy of in silico pathogenicity predictions.精心挑选的腺瘤性结肠息肉病基因（APC）的多重序列比对和计算机预测致病性的准确性。

PLoS One. 2020 Aug 4;15(8):e0233673. doi: 10.1371/journal.pone.0233673. eCollection 2020.

Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations.利用深度突变扫描对变异效应预测器进行基准测试，并识别疾病突变。

Mol Syst Biol. 2020 Jul;16(7):e9380. doi: 10.15252/msb.20199380.

Predicting mutant outcome by combining deep mutational scanning and machine learning.通过结合深度突变扫描和机器学习预测突变结果。

Proteins. 2022 Jan;90(1):45-57. doi: 10.1002/prot.26184. Epub 2021 Jul 31.

Predicting the pathogenicity of missense variants using features derived from AlphaFold2.利用源自 AlphaFold2 的特征预测错义变异的致病性。

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad280.

AI-derived comparative assessment of the performance of pathogenicity prediction tools on missense variants of breast cancer genes.基于人工智能的乳腺癌基因错义变异致病性预测工具性能的比较评估。

Hum Genomics. 2024 Sep 11;18(1):99. doi: 10.1186/s40246-024-00667-9.

引用本文的文献

Assessing variant effect predictors and disease mechanisms in intrinsically disordered proteins.评估内在无序蛋白质中的变异效应预测因子和疾病机制。

PLoS Comput Biol. 2025 Aug 19;21(8):e1013400. doi: 10.1371/journal.pcbi.1013400. eCollection 2025 Aug.

Pathogenic morphological signatures of perturbations in mitochondrial-related genes revealed by pooled imaging assay.通过汇集成像分析揭示的线粒体相关基因扰动的致病性形态学特征。

Npj Imaging. 2025 Aug 1;3(1):35. doi: 10.1038/s44303-025-00097-9.

Multiplexed assays of variant effect for clinical variant interpretation.用于临床变异解读的变异效应多重检测。

Nat Rev Genet. 2025 Jul 21. doi: 10.1038/s41576-025-00870-x.

Multimodal zero-shot learning of previously unseen epitranscriptomes from RNA-seq data.从RNA测序数据中对以前未见过的表观转录组进行多模态零样本学习。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf332.

Machine learning models for pharmacogenomic variant effect predictions - recent developments and future frontiers.用于药物基因组变异效应预测的机器学习模型——最新进展与未来前沿

Pharmacogenomics. 2025 Apr-Apr;26(5-6):171-182. doi: 10.1080/14622416.2025.2504863. Epub 2025 May 22.

Variant effect predictor correlation with functional assays is reflective of clinical classification performance.变异效应预测器与功能测定的相关性反映了临床分类性能。

Genome Biol. 2025 Apr 22;26(1):104. doi: 10.1186/s13059-025-03575-w.

Guidelines for releasing a variant effect predictor.变异效应预测器发布指南。

Genome Biol. 2025 Apr 15;26(1):97. doi: 10.1186/s13059-025-03572-z.

A Phylogenetic Approach to Genomic Language Modeling.一种用于基因组语言建模的系统发育方法。

ArXiv. 2025 Mar 4:arXiv:2503.03773v1.

Landscapes of missense variant impact for human superoxide dismutase 1.人类超氧化物歧化酶1错义变异的影响情况

bioRxiv. 2025 Feb 28:2025.02.25.640191. doi: 10.1101/2025.02.25.640191.

Navigating Uncertainty: Assessing Variants of Uncertain Significance in the CDKL5 Gene for Developmental and Epileptic Encephalopathy Using In Silico Prediction Tools and Computational Analysis.应对不确定性：使用计算机预测工具和计算分析评估发育性和癫痫性脑病中CDKL5基因意义未明的变异体

J Mol Neurosci. 2025 Feb 13;75(1):19. doi: 10.1007/s12031-024-02299-z.

本文引用的文献

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

A structural biology community assessment of AlphaFold2 applications.AlphaFold2 应用的结构生物学社区评估。

Nat Struct Mol Biol. 2022 Nov;29(11):1056-1067. doi: 10.1038/s41594-022-00849-w. Epub 2022 Nov 7.

Robust deep learning-based protein sequence design using ProteinMPNN.使用 ProteinMPNN 进行健壮的基于深度学习的蛋白质序列设计。

Science. 2022 Oct 7;378(6615):49-56. doi: 10.1126/science.add2187. Epub 2022 Sep 15.

Learning protein fitness models from evolutionary and assay-labeled data.从进化和实验标记数据中学习蛋白质适应性模型。

Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.

The impact of AlphaFold2 one year on.AlphaFold2发布一年后的影响。（原英文表述不太准确，推测完整意思可能是这样，根据准确英文原文调整翻译会更准确）

Nat Methods. 2022 Jan;19(1):15-20. doi: 10.1038/s41592-021-01365-3.

Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.AlphaFold 蛋白质结构数据库：用高精度模型极大地扩展蛋白质序列空间的结构覆盖范围。

Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444. doi: 10.1093/nar/gkab1061.

Disease variant prediction with deep generative models of evolutionary data.利用进化数据的深度生成模型进行疾病变异预测。

Nature. 2021 Nov;599(7883):91-95. doi: 10.1038/s41586-021-04043-8. Epub 2021 Oct 27.

Informed training set design enables efficient machine learning-assisted directed protein evolution.知情训练集设计可实现高效的机器学习辅助定向蛋白质进化。

Cell Syst. 2021 Nov 17;12(11):1026-1045.e7. doi: 10.1016/j.cels.2021.07.008. Epub 2021 Aug 19.

Highly accurate protein structure prediction for the human proteome.高精准度的人类蛋白质组蛋白结构预测。

Nature. 2021 Aug;596(7873):590-596. doi: 10.1038/s41586-021-03828-1. Epub 2021 Jul 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

跨蛋白迁移学习显著提高了疾病变异体预测的性能。

Cross-protein transfer learning substantially improves disease variant prediction.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献