评估28种致病性预测方法对编码区罕见单核苷酸变异的性能。

Assessing the performance of 28 pathogenicity prediction methods on rare single nucleotide variants in coding regions.

作者信息

Heo Jee Yeon, Kim Ju Han

机构信息

Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea.

Department of Neuropsychiatry, Seoul National University Hospital, Seoul, 03080, Korea.

出版信息

BMC Genomics. 2025 Jul 7;26(1):641. doi: 10.1186/s12864-025-11787-4.

Abstract

BACKGROUND

Accurate pathogenicity prediction of rare variants in coding regions is crucial for prioritizing candidate variants in human diseases and advancing personalized precision medicine. Although many prediction methods have been developed, it remains unclear how they perform specifically on rare variants.

RESULTS

In this study, the performance of 28 pathogenicity prediction methods was assessed using the latest ClinVar dataset, with a focus on rare variants and various allele frequency (AF) ranges. Ten evaluation metrics were employed to comprehensively assess the predictive performance of each method. The methods were selected based on their training approaches, including whether the training dataset was filtered by AF and whether AF was incorporated as a feature. Most methods focused on missense and start-lost variants, covering only a subset of nonsynonymous SNVs. The average missing rate of approximately 10% was observed in these variants, indicating that prediction scores were unavailable for them. MetaRNN and ClinPred, which incorporated conservation, other prediction scores, and AFs as features, demonstrated the highest predictive power on rare variants. For most methods, specificity was lower than sensitivity. Across various AF ranges, most performance metrics tended to decline as AF decreased, with specificity showing a particularly large decline.

CONCLUSIONS

These results provide insights into the strengths and limitations of each method in predicting the pathogenicity of rare variants, which may guide future improvements in predictive models. Furthermore, while AF and existing prediction scores offer valuable information for prediction methods, the identification of novel biological features is essential to overcome current limitations and further improve predictive performance.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s12864-025-11787-4.

摘要

背景

准确预测编码区罕见变异的致病性对于确定人类疾病候选变异的优先级和推进个性化精准医学至关重要。尽管已经开发了许多预测方法,但它们在罕见变异上的具体表现仍不清楚。

结果

在本研究中,使用最新的ClinVar数据集评估了28种致病性预测方法的性能,重点关注罕见变异和不同的等位基因频率(AF)范围。采用了十个评估指标来全面评估每种方法的预测性能。这些方法是根据其训练方法选择的,包括训练数据集是否按AF进行过滤以及AF是否作为一个特征纳入。大多数方法侧重于错义变异和起始密码子丢失变异,仅涵盖非同义单核苷酸变异的一个子集。在这些变异中观察到平均缺失率约为10%,这表明无法获得它们的预测分数。将保守性、其他预测分数和AF作为特征纳入的MetaRNN和ClinPred在罕见变异上表现出最高的预测能力。对于大多数方法,特异性低于敏感性。在不同的AF范围内,随着AF降低,大多数性能指标往往会下降,特异性下降尤为明显。

结论

这些结果揭示了每种方法在预测罕见变异致病性方面的优势和局限性,这可能会指导预测模型未来的改进。此外,虽然AF和现有的预测分数为预测方法提供了有价值的信息,但识别新的生物学特征对于克服当前的局限性和进一步提高预测性能至关重要。

补充信息

在线版本包含可在10.1186/s12864-025-11787-4获取的补充材料。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/91fb/12235850/085d8a670ce9/12864_2025_11787_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索