Kabir Muhammad, Ahmed Saeed, Zhang Haoyang, Rodríguez-Rodríguez Ignacio, Najibi Seyed Morteza, Vihinen Mauno
Department of Experimental Medical Science, BMC B13, Lund University, SE-22184 Lund, Sweden.
Int J Mol Sci. 2025 Feb 25;26(5):2004. doi: 10.3390/ijms26052004.
Different types of information are combined during variation interpretation. Computational predictors, most often pathogenicity predictors, provide one type of information for this purpose. These tools are based on various kinds of algorithms. Although the American College of Genetics and the Association for Molecular Pathology guidelines classify variants into five categories, practically all pathogenicity predictors provide binary pathogenic/benign predictions. We developed a novel artificial intelligence-based tool, PON-P3, on the basis of a carefully selected training dataset, meticulous feature selection, and optimization. We started with 1526 features describing variations, their sequence and structural context, and parameters for the affected genes and proteins. The final random boosting method was tested and compared with a total of 23 predictors. PON-P3 performed better than recently introduced predictors, which utilize large language models or structural predictions. PON-P3 was better than methods that use evolutionary data alone or in combination with different gene and protein properties. PON-P3 classifies cases into three categories as benign, pathogenic, and variants of uncertain significance (VUSs). When binary test data were used, some metapredictors performed slightly better than PON-P3; however, in real-life situations, with patient data, those methods overpredict both pathogenic and benign cases. We predicted with PON-P3 all possible amino acid substitutions in all human proteins encoded from MANE transcripts. The method was also used to predict all unambiguous VUSs (i.e., without conflicts) in ClinVar. A total of 12.9% were predicted to be pathogenic, and 49.9% were benign.
在变异解读过程中会结合不同类型的信息。计算预测工具,大多数情况下是致病性预测工具,为此提供了一种类型的信息。这些工具基于各种算法。尽管美国医学遗传学与基因组学学会和分子病理学协会的指南将变异分为五类,但实际上所有致病性预测工具都提供致病性/良性的二元预测。我们基于精心挑选的训练数据集、细致的特征选择和优化,开发了一种新型的基于人工智能的工具PON-P3。我们从1526个描述变异、其序列和结构背景以及受影响基因和蛋白质参数的特征开始。最终的随机提升方法经过测试,并与总共23种预测工具进行了比较。PON-P3的表现优于最近推出的利用大语言模型或结构预测的预测工具。PON-P3比单独使用进化数据或与不同基因和蛋白质特性结合使用的方法更好。PON-P3将病例分为良性、致病性和意义未明变异(VUS)三类。当使用二元测试数据时,一些元预测工具的表现略优于PON-P3;然而,在实际情况中,对于患者数据,那些方法会过度预测致病性和良性病例。我们用PON-P3预测了MANE转录本编码的所有人类蛋白质中的所有可能氨基酸替换。该方法还用于预测ClinVar中所有明确的VUS(即无冲突的)。总共12.9%被预测为致病性,49.9%为良性。