Heo Jee Yeon, Kim Ju Han
Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea.
Department of Neuropsychiatry, Seoul National University Hospital, Seoul, 03080, Korea.
Hum Genet. 2025 May 29. doi: 10.1007/s00439-025-02751-z.
Reliable prediction of pathogenic variants plays a crucial role in personalized medicine, which aims to provide accurate diagnosis and individualized treatment using genomic medicine. This study introduces PRP, a pathogenic risk prediction for rare nonsynonymous single nucleotide variants (nsSNVs), including missense, start_lost, stop_gained, and stop_lost variants. PRP was designed to provide robust performance and interpretable predictions using thirty-four features across four categories: frequency, conservation score, substitution metrics, and gene intolerance. Five machine-learning (ML) algorithms were compared to select the optimal model. Hyperparameter optimization was conducted using Optuna, and feature importance was analyzed using Shapley Additive exPlanations (SHAP). PRP used ClinVar data for training and evaluated performance using three independent test datasets and compared it with that of twenty other prediction tools. PRP consistently outperformed state-of-the-art tools across all eight performance metrics: AUC, AUPRC, Accuracy, F1-score, MCC, Precision, Recall, and Specificity. In addition to achieving high sensitivity and high specificity without overestimating the number of pathogenic variants, PRP demonstrates robustness in predicting rare variants. The datasets and codes used for training and testing PRP, along with pre-computed scores, are available at https://github.com/DNAvigation/PRP .
致病性变异的可靠预测在个性化医疗中起着至关重要的作用,个性化医疗旨在利用基因组医学提供准确的诊断和个体化治疗。本研究介绍了PRP,一种针对罕见非同义单核苷酸变异(nsSNV)的致病风险预测方法,包括错义、起始密码子丢失、终止密码子获得和终止密码子丢失变异。PRP旨在利用四类共34个特征提供强大的性能和可解释的预测:频率、保守性评分、替换指标和基因不耐受性。比较了五种机器学习(ML)算法以选择最优模型。使用Optuna进行超参数优化,并使用Shapley加性解释(SHAP)分析特征重要性。PRP使用ClinVar数据进行训练,并使用三个独立的测试数据集评估性能,并将其与其他二十种预测工具的性能进行比较。在所有八项性能指标上,PRP始终优于现有工具:AUC、AUPRC、准确率、F1分数、MCC、精确率、召回率和特异性。除了在不过高估计致病变异数量的情况下实现高敏感性和高特异性外,PRP在预测罕见变异方面也表现出稳健性。用于训练和测试PRP的数据集和代码,以及预先计算的分数,可在https://github.com/DNAvigation/PRP上获取。