基于多组学数据的特征选择策略基准研究。

Benchmark study of feature selection strategies for multi-omics data.

机构信息

Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany.

出版信息

BMC Bioinformatics. 2022 Oct 5;23(1):412. doi: 10.1186/s12859-022-04962-x.

DOI:10.1186/s12859-022-04962-x

PMID:36199022

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9533501/

Abstract

BACKGROUND

In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics.

RESULTS

The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods.

CONCLUSIONS

We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly.

摘要

背景

在过去的几年中，多组学数据（即包含同一批样本不同类型高维分子变量的数据集）变得越来越普遍。迄今为止，已经有几项比较研究集中在组学数据的特征选择方法上，但据我们所知，尚无研究比较这些方法在多组学数据的特殊情况下的性能。鉴于这些数据具有将它们与单组学数据区分开来的特定结构，不清楚是否不同的特征选择策略可能更适合此类数据。在本文中，我们使用了 15 个癌症多组学数据集，比较了四种过滤方法、两种嵌入式方法和两种包装方法，考察了它们在几种可能影响预测结果的情况下对二分类结果的预测性能。我们使用支持向量机和随机森林作为分类器。方法通过重复五次交叉验证进行比较。准确性、AUC 和 Brier 评分作为性能指标。

结果

结果表明，首先，所选特征数量会影响许多特征选择方法的预测性能，但并非所有方法都是如此。其次，特征是按数据类型选择还是同时从所有数据类型中选择，对预测性能的影响不大，但对于某些方法，同时选择需要更多时间。第三，无论考虑哪种性能衡量标准，特征选择方法 mRMR、随机森林的置换重要性和 Lasso 往往优于其他考虑的方法。在这里，mRMR 和随机森林的置换重要性仅考虑少数几个选定的特征时就已经具有很强的预测性能。最后，包装方法的计算成本比过滤方法和嵌入式方法高得多。

结论

我们推荐使用多组学数据的随机森林置换重要性和过滤方法 mRMR 进行特征选择，但是 mRMR 的计算成本要高得多。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b776/9533501/79b2116c8572/12859_2022_4962_Fig1_HTML.jpg

相似文献

Benchmark study of feature selection strategies for multi-omics data.基于多组学数据的特征选择策略基准研究。

BMC Bioinformatics. 2022 Oct 5;23(1):412. doi: 10.1186/s12859-022-04962-x.

Large-scale benchmark study of survival prediction methods using multi-omics data.大规模基于多组学数据的生存预测方法基准研究。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa167.

Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer.比较五种监督特征选择算法，这些算法可从癌症的多组学数据中得到顶级特征和基因特征。

BMC Bioinformatics. 2022 Apr 28;23(Suppl 3):153. doi: 10.1186/s12859-022-04678-y.

Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.在多组学数据中结合多种数据类型是否会提高或降低生存预测的性能？来自大规模基准研究的见解。

BMC Med Inform Decis Mak. 2024 Sep 2;24(1):244. doi: 10.1186/s12911-024-02642-9.

Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis.组学数据分析中癌症亚型识别方法与特征选择方法的比较

BioData Min. 2023 Jul 7;16(1):18. doi: 10.1186/s13040-023-00334-0.

Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data.基于多组学数据预测卵巢癌生存的最小冗余最大相关性多视图特征选择。

BMC Med Genomics. 2018 Sep 14;11(Suppl 3):71. doi: 10.1186/s12920-018-0388-0.

Block Forests: random forests for blocks of clinical and omics covariate data.块森林：用于临床和组学协变量数据块的随机森林。

BMC Bioinformatics. 2019 Jun 27;20(1):358. doi: 10.1186/s12859-019-2942-y.

A benchmark study of deep learning-based multi-omics data fusion methods for cancer.基于深度学习的癌症多组学数据融合方法的基准研究。

Genome Biol. 2022 Aug 9;23(1):171. doi: 10.1186/s13059-022-02739-2.

Computer-assisted lip diagnosis on Traditional Chinese Medicine using multi-class support vector machines.基于多类支持向量机的中医唇诊计算机辅助诊断。

BMC Complement Altern Med. 2012 Aug 16;12:127. doi: 10.1186/1472-6882-12-127.

Classifying breast cancer using multi-view graph neural network based on multi-omics data.基于多组学数据，使用多视图图神经网络对乳腺癌进行分类。

Front Genet. 2024 Feb 20;15:1363896. doi: 10.3389/fgene.2024.1363896. eCollection 2024.

引用本文的文献

Identification of anoikis-related genes in heart failure: bioinformatics and experimental validation.心力衰竭中失巢凋亡相关基因的鉴定：生物信息学与实验验证

Hereditas. 2025 Aug 16;162(1):163. doi: 10.1186/s41065-025-00532-2.

RadiomiX for Radiomics Analysis: Automated Approaches to Overcome Challenges in Replicability.用于放射组学分析的RadiomiX：克服可重复性挑战的自动化方法。

Diagnostics (Basel). 2025 Aug 5;15(15):1968. doi: 10.3390/diagnostics15151968.

Research on RNA modification in disease diagnosis and prognostic biomarkers: current status and challenges.疾病诊断和预后生物标志物中RNA修饰的研究：现状与挑战

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf361.

Prediction of postoperative infection through early-stage salivary microbiota following kidney transplantation using machine learning techniques.利用机器学习技术通过肾移植术后早期唾液微生物群预测术后感染

Ren Fail. 2025 Dec;47(1):2519816. doi: 10.1080/0886022X.2025.2519816. Epub 2025 Jul 3.

Large Language Model (LLM)-Based Advances in Prediction of Post-translational Modification Sites in Proteins.基于大语言模型（LLM）在蛋白质翻译后修饰位点预测方面的进展。

Methods Mol Biol. 2025;2941:313-355. doi: 10.1007/978-1-0716-4623-6_19.

Integrating multi-omics and machine learning for disease resistance prediction in legumes.整合多组学和机器学习用于豆类抗病性预测

Theor Appl Genet. 2025 Jun 27;138(7):163. doi: 10.1007/s00122-025-04948-2.

Therapeutic limitations of oncolytic VSVd51-mediated miR-199a-5p delivery in triple negative breast cancer models.溶瘤性水疱性口炎病毒d51介导的miR-199a-5p递送在三阴性乳腺癌模型中的治疗局限性

Sci Rep. 2025 May 13;15(1):16634. doi: 10.1038/s41598-025-01584-0.

Biomarker-driven drug repurposing for NAFLD-associated hepatocellular carcinoma using machine learning integrated ensemble feature selection.使用机器学习集成特征选择技术，基于生物标志物的非酒精性脂肪性肝病相关肝细胞癌药物再利用研究

Front Bioinform. 2025 Apr 17;5:1522401. doi: 10.3389/fbinf.2025.1522401. eCollection 2025.

Integrative proteomic profiling of tumor and plasma extracellular vesicles identifies a diagnostic biomarker panel for colorectal cancer.肿瘤和血浆细胞外囊泡的综合蛋白质组学分析鉴定出结直肠癌的诊断生物标志物组。

Cell Rep Med. 2025 May 20;6(5):102090. doi: 10.1016/j.xcrm.2025.102090. Epub 2025 Apr 30.

Heterogeneity-preserving discriminative feature selection for disease-specific subtype discovery.用于疾病特异性亚型发现的保持异质性的判别特征选择

Nat Commun. 2025 Apr 16;16(1):3593. doi: 10.1038/s41467-025-58718-1.

本文引用的文献

Large-scale benchmark study of survival prediction methods using multi-omics data.大规模基于多组学数据的生存预测方法基准研究。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa167.

A survey on single and multi omics data mining methods in cancer data classification.癌症数据分类中单/多组学数据挖掘方法的研究综述。

J Biomed Inform. 2020 Jul;107:103466. doi: 10.1016/j.jbi.2020.103466. Epub 2020 Jun 7.

Block Forests: random forests for blocks of clinical and omics covariate data.块森林：用于临床和组学协变量数据块的随机森林。

BMC Bioinformatics. 2019 Jun 27;20(1):358. doi: 10.1186/s12859-019-2942-y.

Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data.用于高维组学数据中生物标志物发现的大规模自动特征选择

Front Genet. 2019 May 16;10:452. doi: 10.3389/fgene.2019.00452. eCollection 2019.

Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification.基于信息增益和支持向量机的混合方法在癌症分类基因选择中的应用

Genomics Proteomics Bioinformatics. 2017 Dec;15(6):389-395. doi: 10.1016/j.gpb.2017.08.002. Epub 2017 Dec 12.

More Is Better: Recent Progress in Multi-Omics Data Integration Methods.越多越好：多组学数据整合方法的最新进展

Front Genet. 2017 Jun 16;8:84. doi: 10.3389/fgene.2017.00084. eCollection 2017.

TP53 mutations, expression and interaction networks in human cancers.人类癌症中的TP53突变、表达及相互作用网络

Oncotarget. 2017 Jan 3;8(1):624-643. doi: 10.18632/oncotarget.13483.

The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.癌症基因组图谱（TCGA）：一个不可估量的知识来源。

Contemp Oncol (Pozn). 2015;19(1A):A68-77. doi: 10.5114/wo.2014.47136.

Investigating the prediction ability of survival models based on both clinical and omics data: two case studies.基于临床和组学数据研究生存模型的预测能力：两个案例研究

Stat Med. 2014 Dec 30;33(30):5310-29. doi: 10.1002/sim.6246. Epub 2014 Jul 9.

Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA.结合多维基因组测量以预测癌症预后：来自癌症基因组图谱（TCGA）的观察结果

Brief Bioinform. 2015 Mar;16(2):291-303. doi: 10.1093/bib/bbu003. Epub 2014 Mar 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于多组学数据的特征选择策略基准研究。

Benchmark study of feature selection strategies for multi-omics data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献