Suppr超能文献

SHINE:基于蛋白质语言模型的短移码插入和缺失变异致病性预测。

SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants.

机构信息

Department of Pediatrics, Columbia University, New York, NY, USA.

Department of Systems Biology, Columbia University, New York, NY, USA.

出版信息

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac584.

Abstract

Accurate variant pathogenicity predictions are important in genetic studies of human diseases. Inframe insertion and deletion variants (indels) alter protein sequence and length, but not as deleterious as frameshift indels. Inframe indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing prediction methods largely use manually encoded features including conservation, protein structure and function, and allele frequency to infer variant pathogenicity. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SHort Inframe iNsertion and dEletion (SHINE). SHINE uses pretrained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.

摘要

准确的变异致病性预测在人类疾病的遗传研究中很重要。移码插入和缺失变异(indels)会改变蛋白质序列和长度,但不如移码 indels 具有危害性。由于可用于训练的已知致病性变异数量有限,因此对 inframe indel 的解释具有挑战性。现有的预测方法主要使用人工编码的特征,包括保守性、蛋白质结构和功能以及等位基因频率,来推断变异的致病性。基于大量蛋白质序列的蛋白质序列和结构的深度学习模型的最新进展提供了一个机会,可以根据大量蛋白质序列来改进显著特征的表示。我们开发了一种用于短 inframe 插入和缺失(SHINE)的新致病性预测器。SHINE 使用预先训练的蛋白质语言模型,从蛋白质序列和多个蛋白质序列比对中构建 indel 及其蛋白质上下文的潜在表示,并将潜在表示输入监督机器学习模型以进行致病性预测。我们从 ClinVar 和 gnomAD 中整理了训练数据,并从不同来源创建了两个测试数据集。在这两个测试集中,SHINE 对缺失和插入变体的预测性能均优于现有方法。我们的工作表明,无监督的蛋白质语言模型可以提供有关蛋白质的有价值信息,并且基于这些模型的新方法可以改善遗传分析中的变异解释。

相似文献

2
PredinID: Predicting Pathogenic Inframe Indels in Human Through Graph Convolution Neural Network With Graph Sampling Technique.
IEEE/ACM Trans Comput Biol Bioinform. 2023 Sep-Oct;20(5):3226-3233. doi: 10.1109/TCBB.2023.3266232. Epub 2023 Oct 9.
3
Evaluation of in silico pathogenicity prediction tools for the classification of small in-frame indels.
BMC Med Genomics. 2023 Feb 28;16(1):36. doi: 10.1186/s12920-023-01454-6.
4
5
Identifying the Impact of Inframe Insertions and Deletions on Protein Function in Cancer.
J Comput Biol. 2020 May;27(5):786-795. doi: 10.1089/cmb.2018.0192. Epub 2019 Aug 28.
6
INDELpred: Improving the prediction and interpretation of indel pathogenicity within the clinical genome.
HGG Adv. 2024 Oct 10;5(4):100325. doi: 10.1016/j.xhgg.2024.100325. Epub 2024 Jul 10.
7
Cohort-driven variant burden analysis and pathogenicity identification in monogenic autoinflammatory disorders.
J Allergy Clin Immunol. 2023 Aug;152(2):517-527. doi: 10.1016/j.jaci.2023.03.028. Epub 2023 Apr 7.
8
Refinement of the clinical variant interpretation framework by statistical evidence and machine learning.
Med. 2021 May 14;2(5):611-632.e9. doi: 10.1016/j.medj.2021.02.003. Epub 2021 Mar 11.
9
Cross-protein transfer learning substantially improves disease variant prediction.
Genome Biol. 2023 Aug 7;24(1):182. doi: 10.1186/s13059-023-03024-6.
10

引用本文的文献

1
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
2
Review: Cancer and neurodevelopmental disorders: multi-scale reasoning and computational guide.
Front Cell Dev Biol. 2024 Jul 2;12:1376639. doi: 10.3389/fcell.2024.1376639. eCollection 2024.
3
Machine Learning-Guided Protein Engineering.
ACS Catal. 2023 Oct 13;13(21):13863-13895. doi: 10.1021/acscatal.3c02743. eCollection 2023 Nov 3.

本文引用的文献

1
Saturation genome editing of DDX3X clarifies pathogenicity of germline and somatic variation.
Nat Commun. 2023 Dec 6;14(1):7702. doi: 10.1038/s41467-023-43041-4.
2
Integrating de novo and inherited variants in 42,607 autism cases identifies mutations in new moderate-risk genes.
Nat Genet. 2022 Sep;54(9):1305-1319. doi: 10.1038/s41588-022-01148-2. Epub 2022 Aug 18.
3
Exome sequencing and analysis of 454,787 UK Biobank participants.
Nature. 2021 Nov;599(7886):628-634. doi: 10.1038/s41586-021-04103-z. Epub 2021 Oct 18.
4
Highly accurate protein structure prediction with AlphaFold.
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
5
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
6
Evidence for 28 genetic disorders discovered by combining healthcare and research data.
Nature. 2020 Oct;586(7831):757-762. doi: 10.1038/s41586-020-2832-5. Epub 2020 Oct 14.
8
Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes.
NPJ Genom Med. 2019 Aug 23;4:19. doi: 10.1038/s41525-019-0093-8. eCollection 2019.
9
In-Frame Indel Mutations in the Genome of the Blind Mexican Cavefish, Astyanax mexicanus.
Genome Biol Evol. 2019 Sep 1;11(9):2563-2573. doi: 10.1093/gbe/evz180.
10
Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome.
PLoS Comput Biol. 2019 Jun 14;15(6):e1007112. doi: 10.1371/journal.pcbi.1007112. eCollection 2019 Jun.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验