Department of Pediatrics, Columbia University, New York, NY, USA.
Department of Systems Biology, Columbia University, New York, NY, USA.
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac584.
Accurate variant pathogenicity predictions are important in genetic studies of human diseases. Inframe insertion and deletion variants (indels) alter protein sequence and length, but not as deleterious as frameshift indels. Inframe indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing prediction methods largely use manually encoded features including conservation, protein structure and function, and allele frequency to infer variant pathogenicity. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SHort Inframe iNsertion and dEletion (SHINE). SHINE uses pretrained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.
准确的变异致病性预测在人类疾病的遗传研究中很重要。移码插入和缺失变异(indels)会改变蛋白质序列和长度,但不如移码 indels 具有危害性。由于可用于训练的已知致病性变异数量有限,因此对 inframe indel 的解释具有挑战性。现有的预测方法主要使用人工编码的特征,包括保守性、蛋白质结构和功能以及等位基因频率,来推断变异的致病性。基于大量蛋白质序列的蛋白质序列和结构的深度学习模型的最新进展提供了一个机会,可以根据大量蛋白质序列来改进显著特征的表示。我们开发了一种用于短 inframe 插入和缺失(SHINE)的新致病性预测器。SHINE 使用预先训练的蛋白质语言模型,从蛋白质序列和多个蛋白质序列比对中构建 indel 及其蛋白质上下文的潜在表示,并将潜在表示输入监督机器学习模型以进行致病性预测。我们从 ClinVar 和 gnomAD 中整理了训练数据,并从不同来源创建了两个测试数据集。在这两个测试集中,SHINE 对缺失和插入变体的预测性能均优于现有方法。我们的工作表明,无监督的蛋白质语言模型可以提供有关蛋白质的有价值信息,并且基于这些模型的新方法可以改善遗传分析中的变异解释。