Faculty of Biology and Biotechnology, HSE University, Moscow, Russia.
Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry RAS, Moscow, Russia.
PeerJ. 2022 Mar 30;10:e13200. doi: 10.7717/peerj.13200. eCollection 2022.
Feature selection is one of the main techniques used to prevent overfitting in machine learning applications. The most straightforward approach for feature selection is an exhaustive search: one can go over all possible feature combinations and pick up the model with the highest accuracy. This method together with its optimizations were actively used in biomedical research, however, publicly available implementation is missing. We present ExhauFS-the user-friendly command-line implementation of the exhaustive search approach for classification and survival regression. Aside from tool description, we included three application examples in the manuscript to comprehensively review the implemented functionality. First, we executed ExhauFS on a toy cervical cancer dataset to illustrate basic concepts. Then, multi-cohort microarray breast cancer datasets were used to construct gene signatures for 5-year recurrence classification. The vast majority of signatures constructed by ExhauFS passed 0.65 threshold of sensitivity and specificity on all datasets, including the validation one. Moreover, a number of gene signatures demonstrated reliable performance on independent RNA-seq dataset without any coefficient re-tuning, , turned out to be cross-platform. Finally, Cox survival regression models were used to fit isomiR signatures for overall survival prediction for patients with colorectal cancer. Similarly to the previous example, the major part of models passed the pre-defined concordance index threshold 0.65 on all datasets. In both real-world scenarios (breast and colorectal cancer datasets), ExhauFS was benchmarked against state-of-the-art feature selection models, including L-regularized sparse models. In case of breast cancer, we were unable to construct reliable cross-platform classifiers using alternative feature selection approaches. In case of colorectal cancer not a single model passed the same 0.65 threshold. Source codes and documentation of ExhauFS are available on GitHub: https://github.com/s-a-nersisyan/ExhauFS.
特征选择是机器学习应用中防止过拟合的主要技术之一。特征选择最直接的方法是穷举搜索:可以遍历所有可能的特征组合,选择准确性最高的模型。这种方法及其优化在生物医学研究中得到了积极的应用,但是缺少公开的实现。我们提出了 ExhauFS,这是一种用于分类和生存回归的穷举搜索方法的用户友好型命令行实现。除了工具描述,我们在本文档中包含了三个应用示例,以全面审查实现的功能。首先,我们在一个玩具宫颈癌数据集上执行 ExhauFS,以说明基本概念。然后,使用多队列微阵列乳腺癌数据集构建用于 5 年复发分类的基因特征。在所有数据集上,包括验证数据集,ExhauFS 构建的大多数特征签名都通过了敏感性和特异性的 0.65 阈值,包括验证数据集。此外,许多基因特征在没有任何系数重新调整的情况下,在独立的 RNA-seq 数据集上表现出可靠的性能,结果是跨平台的。最后,使用 Cox 生存回归模型拟合用于预测结直肠癌患者总体生存的 isomiR 特征。与前面的示例类似,大多数模型在所有数据集上都通过了预定义的一致性指数阈值 0.65。在这两个真实场景(乳腺癌和结直肠癌数据集)中,ExhauFS 与最先进的特征选择模型(包括 L 正则化稀疏模型)进行了基准测试。在乳腺癌的情况下,我们无法使用替代特征选择方法构建可靠的跨平台分类器。在结直肠癌的情况下,没有一个模型通过相同的 0.65 阈值。ExhauFS 的源代码和文档可在 GitHub 上获得:https://github.com/s-a-nersisyan/ExhauFS。