Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany.
BMC Bioinformatics. 2022 Oct 5;23(1):412. doi: 10.1186/s12859-022-04962-x.
In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics.
The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods.
We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly.
在过去的几年中,多组学数据(即包含同一批样本不同类型高维分子变量的数据集)变得越来越普遍。迄今为止,已经有几项比较研究集中在组学数据的特征选择方法上,但据我们所知,尚无研究比较这些方法在多组学数据的特殊情况下的性能。鉴于这些数据具有将它们与单组学数据区分开来的特定结构,不清楚是否不同的特征选择策略可能更适合此类数据。在本文中,我们使用了 15 个癌症多组学数据集,比较了四种过滤方法、两种嵌入式方法和两种包装方法,考察了它们在几种可能影响预测结果的情况下对二分类结果的预测性能。我们使用支持向量机和随机森林作为分类器。方法通过重复五次交叉验证进行比较。准确性、AUC 和 Brier 评分作为性能指标。
结果表明,首先,所选特征数量会影响许多特征选择方法的预测性能,但并非所有方法都是如此。其次,特征是按数据类型选择还是同时从所有数据类型中选择,对预测性能的影响不大,但对于某些方法,同时选择需要更多时间。第三,无论考虑哪种性能衡量标准,特征选择方法 mRMR、随机森林的置换重要性和 Lasso 往往优于其他考虑的方法。在这里,mRMR 和随机森林的置换重要性仅考虑少数几个选定的特征时就已经具有很强的预测性能。最后,包装方法的计算成本比过滤方法和嵌入式方法高得多。
我们推荐使用多组学数据的随机森林置换重要性和过滤方法 mRMR 进行特征选择,但是 mRMR 的计算成本要高得多。