Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Milan 20133, Italy.
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae605.
Feature selection approaches are widely used in gene expression data analysis to identify the most relevant features and boost performance in regression and classification tasks. However, such algorithms solely consider each feature's quantitative contribution to the task, possibly limiting the biological interpretability of the results. Feature-related prior knowledge, such as functional annotations and pathways information, can be incorporated into feature selection algorithms to potentially improve model performance and interpretability.
We propose an embedded integrative approach to feature selection that combines weighted LASSO feature selection and prior biological knowledge in a single step, by means of a novel score of biological relevance that summarizes information extracted from popular biological knowledge bases. Findings from the performed experiments indicate that our proposed approach is able to identify the most predictive genes while simultaneously enhancing the biological interpretability of the results compared to the standard LASSO regularized model.
Code is available at https://github.com/DEIB-GECO/GIS-weigthed_LASSO.
特征选择方法在基因表达数据分析中被广泛应用,以识别最相关的特征,并在回归和分类任务中提高性能。然而,这些算法仅考虑每个特征对任务的定量贡献,可能限制了结果的生物学可解释性。特征相关的先验知识,如功能注释和途径信息,可以被纳入特征选择算法中,以潜在地提高模型性能和可解释性。
我们提出了一种嵌入式综合特征选择方法,通过一种新的生物学相关性评分,将加权 LASSO 特征选择和单一步骤中的先验生物学知识相结合,该评分综合了从流行的生物学知识库中提取的信息。所进行的实验结果表明,与标准的 LASSO 正则化模型相比,我们提出的方法能够识别出最具预测性的基因,同时增强了结果的生物学可解释性。