MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
Sci Rep. 2024 Oct 30;14(1):26114. doi: 10.1038/s41598-024-76202-6.
Variant effect predictors (VEPs) are computational tools developed to assess the impacts of genetic mutations, often in terms of likely pathogenicity, employing diverse algorithms and training data. Here, we investigate the performance of 35 VEPs in the discrimination between pathogenic and putatively benign missense variants across 963 human protein-coding genes. We observe considerable gene-level heterogeneity as measured by the widely used area under the receiver operating characteristic curve (AUROC) metric. To investigate the origins of this heterogeneity and the extent to which gene-level VEP performance is predictable, for each VEP, we train random forest models to predict the gene-level AUROC. We find that performance as measured by AUROC is related to factors such as gene function, protein structure, and evolutionary conservation. Notably, intrinsic disorder in proteins emerged as a significant factor influencing apparent VEP performance, often leading to inflated AUROC values due to their enrichment in weakly conserved putatively benign variants. Our results suggest that gene-level features may be useful for identifying genes where VEP predictions are likely to be more or less reliable. However, our work also shows that AUROC, despite being independent of class balance, still has crucial limitations when used for comparing VEP performance across different genes.
变异效应预测器(VEPs)是为评估遗传突变的影响而开发的计算工具,通常涉及致病性的可能性,采用不同的算法和训练数据。在这里,我们研究了 35 种 VEP 在区分 963 个人类蛋白质编码基因中的致病性和推测良性错义变体方面的性能。我们观察到相当大的基因水平异质性,这是通过广泛使用的接收器操作特征曲线(AUROC)度量来衡量的。为了研究这种异质性的起源以及基因水平 VEP 性能可预测的程度,对于每个 VEP,我们都训练随机森林模型来预测基因水平的 AUROC。我们发现,AUROC 衡量的性能与基因功能、蛋白质结构和进化保守性等因素有关。值得注意的是,蛋白质中的固有无序性成为影响明显 VEP 性能的重要因素,由于它们在弱保守的推测良性变体中富集,往往导致 AUROC 值膨胀。我们的结果表明,基因水平的特征可能有助于识别 VEP 预测可能更可靠或更不可靠的基因。然而,我们的工作还表明,AUROC 尽管独立于类别平衡,但在用于比较不同基因的 VEP 性能时仍然存在关键限制。