基于组学的生物标志物发现中多目标特征选择基准测试的综合评估框架

A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery.

作者信息

Cattelani Luca, Ghosh Arindam, Rintala Teemu J, Fortino Vittorio

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2024 Nov-Dec;21(6):2432-2446. doi: 10.1109/TCBB.2024.3480150. Epub 2024 Dec 10.

DOI:10.1109/TCBB.2024.3480150

Abstract

Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.

摘要

机器学习算法已被广泛用于基于基因表达生物标志物对癌症亚型进行准确分类。然而，结合多个基因表达特征的生物标志物模型在外部验证数据集中往往不可重复，并且其特征集大小通常未得到优化，这危及了它们转化为具有成本效益的临床工具的可行性。我们研究了如何解决在分类性能和集大小之间找到最佳权衡的多目标问题，应用七种机器学习驱动的特征子集选择算法，并分析它们在包含八个癌症大规模转录组数据集（涵盖训练集和外部验证集）的基准测试中的表现。该基准测试包括根据组成基因的准确性、多样性和稳定性来评估单个生物标志物和解决方案集性能的评估指标。此外，还提出了一种用于交叉验证研究的新评估指标，该指标推广了常用于评估多目标优化算法性能的超体积。分别使用4、2和7个特征，在乳腺癌、肾癌和卵巢癌的外部数据集上获得了平衡准确率为0.8的生物标志物。遗传算法通常比其他考虑的算法表现更好，并且最近提出的NSGA2-CH和NSGA2-CHS在大多数情况下是表现最佳的方法。

相似文献

A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery.基于组学的生物标志物发现中多目标特征选择基准测试的综合评估框架

IEEE/ACM Trans Comput Biol Bioinform. 2024 Nov-Dec;21(6):2432-2446. doi: 10.1109/TCBB.2024.3480150. Epub 2024 Dec 10.

Triple and quadruple optimization for feature selection in cancer biomarker discovery.癌症生物标志物发现中的特征选择的三重和四重优化。

J Biomed Inform. 2024 Nov;159:104736. doi: 10.1016/j.jbi.2024.104736. Epub 2024 Oct 11.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果：一种针对特定个体见解的新型验证方法。

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.基于分子特征的腹膜后脂肪肉瘤分类：一项前瞻性队列研究。

Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.

Deciphering Shared Gene Signatures and Immune Infiltration Characteristics Between Gestational Diabetes Mellitus and Preeclampsia by Integrated Bioinformatics Analysis and Machine Learning.通过综合生物信息学分析和机器学习破译妊娠期糖尿病和子痫前期之间共享的基因特征及免疫浸润特征

Reprod Sci. 2025 May 15. doi: 10.1007/s43032-025-01847-1.

Classification of finger movements through optimal EEG channel and feature selection.通过最优脑电图通道和特征选择对手指运动进行分类。

Front Hum Neurosci. 2025 Jul 16;19:1633910. doi: 10.3389/fnhum.2025.1633910. eCollection 2025.

A Responsible Framework for Assessing, Selecting, and Explaining Machine Learning Models in Cardiovascular Disease Outcomes Among People With Type 2 Diabetes: Methodology and Validation Study.用于评估、选择和解释2型糖尿病患者心血管疾病结局机器学习模型的责任框架：方法与验证研究

JMIR Med Inform. 2025 Jun 27;13:e66200. doi: 10.2196/66200.

Integrated multi-omics analysis and machine learning identify G protein-coupled receptor-related signatures for diagnosis and clinical benefits in soft tissue sarcoma.整合多组学分析和机器学习识别出用于软组织肉瘤诊断及临床获益的G蛋白偶联受体相关特征。

Front Immunol. 2025 Jul 21;16:1561227. doi: 10.3389/fimmu.2025.1561227. eCollection 2025.

XGB-BIF: An XGBoost-Driven Biomarker Identification Framework for Detecting Cancer Using Human Genomic Data.XGB-BIF：一种用于利用人类基因组数据检测癌症的基于XGBoost的生物标志物识别框架。

Int J Mol Sci. 2025 Jun 11;26(12):5590. doi: 10.3390/ijms26125590.

引用本文的文献

Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection.用于系统高估调整的双阶段优化器应用于生物标志物选择的多目标遗传算法

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae674.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于组学的生物标志物发现中多目标特征选择基准测试的综合评估框架

A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献