Joshi Dinesh, Pradhan Swatantra, Sajeed Rakshanda, Srinivasan Rajgopal, Rana Sadhna
TCS Research, Tata Consultancy Services, Hyderabad, India.
Hum Genet. 2025 Mar;144(2-3):143-158. doi: 10.1007/s00439-025-02727-z. Epub 2025 Jan 27.
Variants of uncertain significance (VUS) represent variants that lack sufficient evidence to be confidently associated with a disease, thus posing a challenge in the interpretation of genetic testing results. Here we report an improved method for predicting the VUS of Arylsulfatase A (ARSA) gene as part of the Critical Assessment of Genome Interpretation challenge (CAGI6). Our method uses a transfer learning approach that leverages a pre-trained protein language model to predict the impact of mutations on the activity of the ARSA enzyme, whose deficiency is known to cause a rare genetic disorder, metachromatic leukodystrophy. Our innovative framework combines zero-shot log odds scores and embeddings from the ESM, an evolutionary scale model as features for training a supervised model on gene variants functionally related to the ARSA gene. The zero-shot log odds score feature captures the generic properties of the proteins learned due to its pre-training on millions of sequences in the UniProt data, while the ESM embeddings for the proteins in the ARSA family capture features specific to the family. We also tested our approach on another enzyme, N-acetyl-glucosaminidase (NAGLU), that belongs to the same superfamily as ARSA. Our results demonstrate that the performance of our family models (augmented ESM models) is either comparable or better than the ESM models. The ARSA model compares favorably with the majority of state-of-the-art predictors on area under precision and recall curve (AUPRC) performance metric. However, the NAGLU model outperforms all pathogenicity predictors evaluated in this study on AUPRC metric. The improved AUPRC has relevance in a diagnostic setting where variant prioritization generally entails identifying a small number of pathogenic variants from a larger number of benign variants. Our results also indicate that genes that have sparse or no experimental variant impact data, the family variant data can serve as a proxy training data for making accurate predictions. Attention analysis of active sites and binding sites in ARSA and NAGLU proteins shed light on probable mechanisms of pathogenicity for positions that are highly attended.
意义未明的变异(VUS)是指那些缺乏足够证据来确定与疾病相关联的变异,因此在基因检测结果的解读中构成了挑战。在此,我们报告一种改进的方法,用于预测芳基硫酸酯酶A(ARSA)基因的VUS,这是基因组解读关键评估挑战(CAGI6)的一部分。我们的方法采用迁移学习方法,利用预训练的蛋白质语言模型来预测突变对ARSA酶活性的影响,已知该酶的缺乏会导致一种罕见的遗传疾病——异染性脑白质营养不良。我们创新的框架结合了零样本对数优势分数和来自ESM(一种进化尺度模型)的嵌入,作为在与ARSA基因功能相关的基因变异上训练监督模型的特征。零样本对数优势分数特征捕捉了由于在UniProt数据中对数百万序列进行预训练而学到的蛋白质的一般属性,而ARSA家族中蛋白质的ESM嵌入捕捉了该家族特有的特征。我们还在另一种与ARSA属于同一超家族的酶——N - 乙酰葡糖胺酶(NAGLU)上测试了我们的方法。我们的结果表明,我们的家族模型(增强的ESM模型)的性能与ESM模型相当或更好。在精确率和召回率曲线下面积(AUPRC)性能指标方面,ARSA模型与大多数最先进的预测器相比具有优势。然而,在AUPRC指标上,NAGLU模型优于本研究中评估的所有致病性预测器。改进后的AUPRC在诊断环境中具有相关性,在这种环境中,变异优先级排序通常需要从大量良性变异中识别出少数致病性变异。我们的结果还表明,对于那些缺乏或没有实验性变异影响数据的基因,家族变异数据可以作为代理训练数据来进行准确预测。对ARSA和NAGLU蛋白质的活性位点和结合位点的注意力分析揭示了高关注度位置的可能致病机制。