School of Information Technology, Monash University Malaysia, Subang Jaya, Selangor, Malaysia.
BMC Bioinformatics. 2022 Aug 7;23(1):325. doi: 10.1186/s12859-022-04870-0.
The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS).
We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction.
Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).
目前疟疾风险预测仅限于使用高级统计方法,例如对流行病学数据进行时间序列和聚类分析。然而,已经探索了机器学习模型通过血涂片图像和环境数据来研究疟疾的复杂性。然而,据我们所知,没有研究使用机器学习模型分析单核苷酸多态性(SNP)对疟疾的贡献。更具体地说,本研究旨在通过使用累积 SNP 效应获得的风险评分(称为加权遗传风险评分(wGRS))来量化个体对疟疾发展的易感性。
我们提出了一种基于 SNP 的特征提取算法,该算法结合了个体对疟疾的易感性信息来生成特征集。然而,机器学习模型从许多 SNP 中学习可能会变得计算成本很高。因此,我们使用逻辑回归和递归特征消除(LR-RFE)方法来减少特征集,以选择可以提高模型效果的 SNP。接下来,我们计算所选特征集的 wGRS,该值用作模型的目标变量。此外,为了比较 wGRS 模型的性能,我们计算并评估了 wGRS 与基因型频率(wGRS+GF)的组合。最后,使用 Light Gradient Boosting Machine(LightGBM)、eXtreme Gradient Boosting(XGBoost)和 Ridge 回归算法来建立疟疾风险预测的机器学习模型。
我们提出的方法确定 SNP rs334 是最重要的特征,与基线相比,重要性得分 6.224,而基线的重要性得分 1.1314。这是一个重要的结果,因为先前的研究已经证明 rs334 是疟疾的主要遗传风险因素。对三种机器学习模型的分析和比较表明,LightGBM 实现了最高的模型性能,平均绝对误差(MAE)评分为 0.0373。此外,基于 wGRS+GF,所有模型的性能均明显优于仅基于 wGRS 的模型,其中 LightGBM 的性能最佳(MAE 评分为 0.0033)。