Caliskan Deniz, Caliskan Aylin, Dandekar Thomas, Breitenbach Tim
Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, Würzburg D-97074, Germany.
Comput Struct Biotechnol J. 2025 Aug 5;27:3510-3527. doi: 10.1016/j.csbj.2025.07.047. eCollection 2025.
Identifying biologically meaningful gene sets and evaluating their ability to separate conditions based on gene expression is an important step in many transcriptomic analyses. While most workflows support data-driven feature selection, few allow direct evaluation of predefined gene sets in a classification context. This limits the ability to assess literature-derived panels or biologically motivated hypotheses prior to downstream analysis. For this, we developed gSELECT, a Python library for evaluating the classification performance of both automatically ranked and user-defined gene sets. It operates on .csv or .h5ad expression matrices with group labels and can be easily integrated into existing analysis pipelines. Gene selection can be based on mutual information ranking, random sampling, or custom input. This supports hypothesis-driven testing without data-derived selection bias and allows direct evaluation of known or candidate markers. Classification is performed using multilayer perceptrons with Monte Carlo cross-validation, either on the full dataset or with a user-defined train/test split. Exhaustive and greedy strategies are available to explore combinatorial effects among genes to identify minimal gene combinations with high predictive power. gSELECT is intended as a pre-analysis tool to evaluate dataset separability and to support early assessment of candidate genes before committing to resource-intensive downstream analyses.
识别具有生物学意义的基因集,并评估它们基于基因表达区分不同条件的能力,是许多转录组分析中的重要一步。虽然大多数工作流程支持数据驱动的特征选择,但很少有工具允许在分类背景下直接评估预定义的基因集。这限制了在下游分析之前评估源自文献的基因面板或生物学驱动假设的能力。为此,我们开发了gSELECT,这是一个用于评估自动排序和用户定义基因集分类性能的Python库。它对带有组标签的.csv或.h5ad表达矩阵进行操作,并且可以轻松集成到现有的分析管道中。基因选择可以基于互信息排名、随机抽样或自定义输入。这支持无数据衍生选择偏差的假设驱动测试,并允许直接评估已知或候选标记。分类使用具有蒙特卡罗交叉验证的多层感知器进行,可在完整数据集上进行,也可使用用户定义的训练/测试分割。有穷举和贪婪策略可用于探索基因之间的组合效应,以识别具有高预测能力的最小基因组合。gSELECT旨在作为一种预分析工具,用于评估数据集的可分离性,并在进行资源密集型下游分析之前支持对候选基因的早期评估。