Suppr超能文献

评估随机森林的自再现性以发现最佳的短生物标志物特征。

Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery.

作者信息

Debit Ahmed, Poulet Christophe, Josse Claire, Jerusalem Guy, Azencott Chloe-Agathe, Bours Vincent, Van Steen Kristel

机构信息

Laboratory of Human Genetics, GIGA Institute, University of Liege (ULiege), Avenue Hippocrate 1/11, 4000 Liege, Belgium.

BIO3, GIGA Institute, University of Liege (ULiege), Avenue Hippocrate 1/11, 4000 Liege, Belgium.

出版信息

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf318.

Abstract

Biomarker signature discovery remains the main path to developing clinical diagnostic tools when the biological knowledge on pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such a set of molecules. The classification algorithm that will be used is often selected based on the average Area Under the Curve (AUC) performance of its models. However, it is not guaranteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the Hyper-stability Resampling Sensitive (HRS) and the Hyper-stability Signature Sensitive (HSS), as complementary metrics to the average AUC that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forest implementations. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for their average AUC performance and AUC-derived stability, prior to analysis.

摘要

当病理学方面的生物学知识薄弱时,生物标志物特征发现仍然是开发临床诊断工具的主要途径。为降低诊断成本,通常更倾向于最短的特征。找到最佳且最短特征的能力依赖于基于这样一组分子构建的模型的稳健性。所使用的分类算法通常是根据其模型的平均曲线下面积(AUC)性能来选择的。然而,不能保证具有较大AUC分布的算法在面对数据时能保持稳定性能。在此,我们提出两个源自AUC的超稳定性分数,即超稳定性重采样敏感性(HRS)和超稳定性特征敏感性(HSS),作为平均AUC的补充指标,它们应能为选择最佳分类算法带来信心。为强调这些分数的重要性,我们比较了15种不同的随机森林实现。我们的研究结果表明,应根据手头的数据和所评估的分类问题来选择随机森林实现。没有一种随机森林实现可以普遍适用于任何分类和任何数据集。在进行分析之前,应对它们各自的平均AUC性能和源自AUC的稳定性进行测试。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验