评估28种致病性预测方法对编码区罕见单核苷酸变异的性能。

Assessing the performance of 28 pathogenicity prediction methods on rare single nucleotide variants in coding regions.

作者信息

Heo Jee Yeon, Kim Ju Han

机构信息

Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea.

Department of Neuropsychiatry, Seoul National University Hospital, Seoul, 03080, Korea.

出版信息

BMC Genomics. 2025 Jul 7;26(1):641. doi: 10.1186/s12864-025-11787-4.

DOI:10.1186/s12864-025-11787-4

PMID:40624478

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12235850/

Abstract

BACKGROUND

Accurate pathogenicity prediction of rare variants in coding regions is crucial for prioritizing candidate variants in human diseases and advancing personalized precision medicine. Although many prediction methods have been developed, it remains unclear how they perform specifically on rare variants.

RESULTS

In this study, the performance of 28 pathogenicity prediction methods was assessed using the latest ClinVar dataset, with a focus on rare variants and various allele frequency (AF) ranges. Ten evaluation metrics were employed to comprehensively assess the predictive performance of each method. The methods were selected based on their training approaches, including whether the training dataset was filtered by AF and whether AF was incorporated as a feature. Most methods focused on missense and start-lost variants, covering only a subset of nonsynonymous SNVs. The average missing rate of approximately 10% was observed in these variants, indicating that prediction scores were unavailable for them. MetaRNN and ClinPred, which incorporated conservation, other prediction scores, and AFs as features, demonstrated the highest predictive power on rare variants. For most methods, specificity was lower than sensitivity. Across various AF ranges, most performance metrics tended to decline as AF decreased, with specificity showing a particularly large decline.

CONCLUSIONS

These results provide insights into the strengths and limitations of each method in predicting the pathogenicity of rare variants, which may guide future improvements in predictive models. Furthermore, while AF and existing prediction scores offer valuable information for prediction methods, the identification of novel biological features is essential to overcome current limitations and further improve predictive performance.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s12864-025-11787-4.

摘要

背景

准确预测编码区罕见变异的致病性对于确定人类疾病候选变异的优先级和推进个性化精准医学至关重要。尽管已经开发了许多预测方法，但它们在罕见变异上的具体表现仍不清楚。

结果

在本研究中，使用最新的ClinVar数据集评估了28种致病性预测方法的性能，重点关注罕见变异和不同的等位基因频率（AF）范围。采用了十个评估指标来全面评估每种方法的预测性能。这些方法是根据其训练方法选择的，包括训练数据集是否按AF进行过滤以及AF是否作为一个特征纳入。大多数方法侧重于错义变异和起始密码子丢失变异，仅涵盖非同义单核苷酸变异的一个子集。在这些变异中观察到平均缺失率约为10%，这表明无法获得它们的预测分数。将保守性、其他预测分数和AF作为特征纳入的MetaRNN和ClinPred在罕见变异上表现出最高的预测能力。对于大多数方法，特异性低于敏感性。在不同的AF范围内，随着AF降低，大多数性能指标往往会下降，特异性下降尤为明显。