Suppr超能文献

基于组学的生物标志物发现中多目标特征选择基准测试的综合评估框架

A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery.

作者信息

Cattelani Luca, Ghosh Arindam, Rintala Teemu J, Fortino Vittorio

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2024 Nov-Dec;21(6):2432-2446. doi: 10.1109/TCBB.2024.3480150. Epub 2024 Dec 10.

Abstract

Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.

摘要

机器学习算法已被广泛用于基于基因表达生物标志物对癌症亚型进行准确分类。然而,结合多个基因表达特征的生物标志物模型在外部验证数据集中往往不可重复,并且其特征集大小通常未得到优化,这危及了它们转化为具有成本效益的临床工具的可行性。我们研究了如何解决在分类性能和集大小之间找到最佳权衡的多目标问题,应用七种机器学习驱动的特征子集选择算法,并分析它们在包含八个癌症大规模转录组数据集(涵盖训练集和外部验证集)的基准测试中的表现。该基准测试包括根据组成基因的准确性、多样性和稳定性来评估单个生物标志物和解决方案集性能的评估指标。此外,还提出了一种用于交叉验证研究的新评估指标,该指标推广了常用于评估多目标优化算法性能的超体积。分别使用4、2和7个特征,在乳腺癌、肾癌和卵巢癌的外部数据集上获得了平衡准确率为0.8的生物标志物。遗传算法通常比其他考虑的算法表现更好,并且最近提出的NSGA2-CH和NSGA2-CHS在大多数情况下是表现最佳的方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验