Dai Yifan, Zou Fei, Zou Baiming
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
bioRxiv. 2024 Oct 7:2024.10.04.616748. doi: 10.1101/2024.10.04.616748.
Omics data generated from high-throughput technologies and clinical features jointly impact many complex human diseases. Identifying key biomarkers and clinical risk factors is essential for understanding disease mechanisms and advancing early disease diagnosis and precision medicine. However, the high-dimensionality and intricate associations between disease outcomes and omics profiles present significant analytical challenges. To address these, we propose an ensemble data-driven biomarker identification tool, Hybrid Feature Screening (HFS), to construct a candidate feature set for downstream advanced machine learning models. The pre-screened candidate features from HFS are further refined using a computationally efficient permutation-based feature importance test, forming the comprehensive High-dimensional Feature Importance Test (HiFIT) framework. Through extensive numerical simulations and real-world applications, we demonstrate HiFIT's superior performance in both outcome prediction and feature importance identification. An R package implementing HiFIT is available on GitHub (https://github.com/BZou-lab/HiFIT).
高通量技术生成的组学数据和临床特征共同影响着许多复杂的人类疾病。识别关键生物标志物和临床风险因素对于理解疾病机制、推进疾病早期诊断和精准医学至关重要。然而,疾病结局与组学特征之间的高维度和复杂关联带来了重大的分析挑战。为解决这些问题,我们提出了一种集成数据驱动的生物标志物识别工具——混合特征筛选(HFS),以构建用于下游先进机器学习模型的候选特征集。来自HFS的预筛选候选特征使用基于计算效率高的置换的特征重要性测试进一步优化,形成了全面的高维特征重要性测试(HiFIT)框架。通过广泛的数值模拟和实际应用,我们证明了HiFIT在结局预测和特征重要性识别方面的卓越性能。一个实现HiFIT的R包可在GitHub上获取(https://github.com/BZou-lab/HiFIT)。