Suppr超能文献

优化混合集成特征选择策略以发现复杂疾病中的转录组生物标志物。

Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases.

作者信息

Claude Elsa, Leclercq Mickaël, Thébault Patricia, Droit Arnaud, Uricaru Raluca

机构信息

Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France.

Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada.

出版信息

NAR Genom Bioinform. 2024 Jul 11;6(3):lqae079. doi: 10.1093/nargab/lqae079. eCollection 2024 Sep.

Abstract

Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.

摘要

生物医学研究利用转录组学等组学数据来揭示疾病的复杂性。传统策略通过依赖特征选择方法来识别以与表型相关的表达模式为特征的转录组生物标志物。混合集成特征选择(HEFS)因其通过执行数据和功能扰动来确保所选特征的稳健性而越来越受欢迎。然而,在设计此类方法时,在每个步骤做出最适合的选择仍然很困难。我们对从转录组数据中识别IV期结直肠癌、I期肾癌和肺癌以及III期子宫内膜癌生物标志物的四种可能的HEFS方案进行了广泛分析。这些方案研究了通过两种类型的过滤特征约简(差异表达基因和方差)与两种类型的重采样策略(分布平衡分层重复留出法和随机分层法)的联合使用,通过数千个包装机器学习模型的聚合进行下游特征选择。基于我们的结果,我们强调使用HEFS方法识别复杂疾病生物标志物的优势,因为它们能够对数据和功能扰动产生可推广且稳定的结果。最后,我们突出了在设计此类策略时需要考虑的关键问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4610/11237901/6709ec750df4/lqae079figgra1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验