Network Science Institute, Northeastern University, Boston, MA 02115, United States.
Scipher Medicine, Waltham, MA 02453, United States.
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae341.
A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features.
Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML.
在医学数据集上使用机器学习(ML)的主要障碍是大量变量与小样本量之间的差异。虽然已经提出了多种特征选择技术来避免由此产生的过拟合,但总体集成技术提供了最佳的选择稳健性。然而,目前旨在组合不同算法的方法通常未能利用其组件确定的依赖关系。在这里,我们提出了基于图论的集成特征选择技术 Graphical Ensembling (GE),旨在提高所选特征的稳定性和相关性。
我们依赖于四个数据集,展示了 GE 如何通过选择更少的特征来提高分类性能。例如,在类风湿关节炎患者分层中,GE 比基线方法的平衡准确率高 9%,同时依赖更少的特征。我们使用亚细胞网络的数据来表明所选特征(蛋白质)更接近已知的疾病基因,并且发现的生物学机制更加多样化。通过成功解决生物变量之间的复杂相关性,我们预计 GE 将提高 ML 在医学中的应用。