Ka Hamed, Naghinejad Maryam, Amirfiroozy Akbar, Shamsir Mohd Shahir, Parvizpour Sepideh, Razmara Jafar
Department of Computer Science, Faculty of Mathematics, Statistics, and Computer Science, University of Tabriz, Tabriz, Iran.
Department of Medical Genetics, Faculty of Medicine, Tabriz University of Medical Sciences, Tabriz, Iran.
J Hum Genet. 2025 Apr 18. doi: 10.1038/s10038-025-01341-1.
The right classification of variants is the key to pre-symptomatic detection of disease and conducting preventive actions. Since BRCA1 has a high incidence and penetrance in breast and ovarian cancers, a high-performance predictive tool can be employed to classify the clinical significance of its variants. Several tools have previously been developed for this purpose which poorly classify the significance in specific cases. The proposed tools commonly assign a score without providing any interpretation behind it. To reach an accurate predictive tool with interpretation abilities, in this study, we propose BRCA1-Forest which works based on random forest as a well-known machine learning technique for making interpretable decisions with high specificity and sensitivity in variants classification. The method involves narrowing down available options until reaching the final decision. To this end, a set of BRCA1 benign and pathogenic missense variants was collected first, and then, the dataset was prepared based on the effect of each variant on the protein sequence. The dataset was enriched by adding physicochemical changes and the conservation score of the amino acid position as pathogenicity criteria. The proposed model was trained based on the dataset to classify the clinical significance of variants. The performance of BRCA1-Forest was compared to four state-of-the-art methods, SIFT, PolyPhen2, CADD, and DANN, in terms of different evaluation metrics including precision, recall, false positive rate (FPR), the area under the receiver operator curve (AUC ROC), the area under the precision-recall curve (AUC-PR), and Mathew correlation coefficient (MCC). The results reveal that the proposed model outperforms the abovementioned tools in all metrics except for recall. The software of BRCA1-Forest is available at https://github.com/HamedKAAC/BRCA1Forest .
正确分类变异是疾病症状前检测和采取预防措施的关键。由于BRCA1在乳腺癌和卵巢癌中具有高发病率和高外显率,因此可以使用高性能预测工具来分类其变异的临床意义。此前已经开发了几种用于此目的的工具,但在特定情况下对意义的分类效果不佳。这些工具通常只给出一个分数,而不提供任何背后的解释。为了获得一个具有解释能力的准确预测工具,在本研究中,我们提出了BRCA1-Forest,它基于随机森林工作,随机森林是一种著名的机器学习技术,可在变异分类中以高特异性和敏感性做出可解释的决策。该方法包括逐步缩小可用选项范围,直到做出最终决策。为此,首先收集了一组BRCA1良性和致病性错义变异,然后根据每个变异对蛋白质序列的影响准备数据集。通过添加物理化学变化和氨基酸位置的保守性得分作为致病性标准来丰富数据集。基于该数据集对提出的模型进行训练,以分类变异的临床意义。在包括精确率、召回率、假阳性率(FPR)、受试者工作特征曲线下面积(AUC ROC)、精确率-召回率曲线下面积(AUC-PR)和马修相关系数(MCC)等不同评估指标方面,将BRCA1-Forest的性能与四种最先进的方法SIFT、PolyPhen2、CADD和DANN进行了比较。结果表明,除召回率外,所提出的模型在所有指标上均优于上述工具。BRCA1-Forest的软件可在https://github.com/HamedKAAC/BRCA1Forest获取。