Suppr超能文献

基于组学的分类场景中特征选择的有监督相关性-冗余评估。

Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios.

机构信息

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milano, 20133, Italy.

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milano, 20133, Italy.

出版信息

J Biomed Inform. 2023 Aug;144:104457. doi: 10.1016/j.jbi.2023.104457. Epub 2023 Jul 23.

Abstract

BACKGROUND AND OBJECTIVE

Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task.

METHODS

We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced.

RESULTS

We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases.

CONCLUSIONS

ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features.

摘要

背景与目的

转化生物信息学和基因组学中的许多分类任务的特点是潜在特征的高度维度和类之间样本分布的不平衡。这可能会影响分类器的稳健性,并增加过拟合、维度诅咒和泛化泄漏的风险;此外,最重要的是,这可能会阻止在面对复杂疾病(如癌症)时获得精确医学所需的充分患者分层。建立一种特征选择策略,通过去除不相关、冗余和噪声的特征,仅提取适当的预测特征,对于在期望任务上获得有价值的结果至关重要。

方法

我们提出了一种新的特征选择方法,称为 ReRa,基于监督相关性-冗余评估。ReRa 包括相关性过滤的定制步骤,以识别有意义的特征的缩小子集,然后是基于相似性的监督过程,以最小化冗余。后一步创新性地使用全局和类特定相似性评估的组合来去除冗余特征,同时保留那些在类之间区分的特征,即使这些类严重不平衡。

结果

我们将 ReRa 与几种现有的特征选择方法进行了比较,以获得使用几种分类器进行乳腺癌患者亚组分类的特征空间:我们考虑了基于基因或转录本异构体表达的两种用例。在评估的绝大多数场景中,当使用 ReRa 选择的特征空间时,与简单的特征过滤、LASSO 正则化甚至另一种相关性-冗余方法 MRmr 相比,性能显著提高。这两个用例代表了转化应用的一个有见地的例子,利用 ReRa 的能力来研究和增强临床相关的患者分层任务,这也可以很容易地应用于其他癌症类型和疾病。

结论

ReRa 方法有可能提高在不平衡分类情况下使用的机器学习模型的性能。与另一种相关性-冗余方法 MRmr 相比,ReRa 不需要调整保留特征的数量,在巨大的初始维度上确保效率和可扩展性,并允许在冗余评估的每次迭代中重新评估所有先前选择的特征,以最终仅保留最相关和类区分的特征。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验