gSELECT：一个新型的预分析机器学习库，可在单细胞数据中进行早期假设检验和预测性基因选择。

gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.

作者信息

Caliskan Deniz, Caliskan Aylin, Dandekar Thomas, Breitenbach Tim

机构信息

Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, Würzburg D-97074, Germany.

出版信息

Comput Struct Biotechnol J. 2025 Aug 5;27:3510-3527. doi: 10.1016/j.csbj.2025.07.047. eCollection 2025.

DOI:10.1016/j.csbj.2025.07.047

PMID:40821713

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12354962/

Abstract

Identifying biologically meaningful gene sets and evaluating their ability to separate conditions based on gene expression is an important step in many transcriptomic analyses. While most workflows support data-driven feature selection, few allow direct evaluation of predefined gene sets in a classification context. This limits the ability to assess literature-derived panels or biologically motivated hypotheses prior to downstream analysis. For this, we developed gSELECT, a Python library for evaluating the classification performance of both automatically ranked and user-defined gene sets. It operates on .csv or .h5ad expression matrices with group labels and can be easily integrated into existing analysis pipelines. Gene selection can be based on mutual information ranking, random sampling, or custom input. This supports hypothesis-driven testing without data-derived selection bias and allows direct evaluation of known or candidate markers. Classification is performed using multilayer perceptrons with Monte Carlo cross-validation, either on the full dataset or with a user-defined train/test split. Exhaustive and greedy strategies are available to explore combinatorial effects among genes to identify minimal gene combinations with high predictive power. gSELECT is intended as a pre-analysis tool to evaluate dataset separability and to support early assessment of candidate genes before committing to resource-intensive downstream analyses.

摘要

识别具有生物学意义的基因集，并评估它们基于基因表达区分不同条件的能力，是许多转录组分析中的重要一步。虽然大多数工作流程支持数据驱动的特征选择，但很少有工具允许在分类背景下直接评估预定义的基因集。这限制了在下游分析之前评估源自文献的基因面板或生物学驱动假设的能力。为此，我们开发了gSELECT，这是一个用于评估自动排序和用户定义基因集分类性能的Python库。它对带有组标签的.csv或.h5ad表达矩阵进行操作，并且可以轻松集成到现有的分析管道中。基因选择可以基于互信息排名、随机抽样或自定义输入。这支持无数据衍生选择偏差的假设驱动测试，并允许直接评估已知或候选标记。分类使用具有蒙特卡罗交叉验证的多层感知器进行，可在完整数据集上进行，也可使用用户定义的训练/测试分割。有穷举和贪婪策略可用于探索基因之间的组合效应，以识别具有高预测能力的最小基因组合。gSELECT旨在作为一种预分析工具，用于评估数据集的可分离性，并在进行资源密集型下游分析之前支持对候选基因的早期评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eaa0/12354962/8cab60912676/ga1.jpg

相似文献

gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.

Comput Struct Biotechnol J. 2025 Aug 5;27:3510-3527. doi: 10.1016/j.csbj.2025.07.047. eCollection 2025.

Prescription of Controlled Substances: Benefits and Risks

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

MarkVCID cerebral small vessel consortium: I. Enrollment, clinical, fluid protocols.

Alzheimers Dement. 2021 Apr;17(4):704-715. doi: 10.1002/alz.12215. Epub 2021 Jan 21.

[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].

Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.

Artificial intelligence for diagnosing exudative age-related macular degeneration.

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.

Clin Orthop Relat Res. 2025 Aug 20. doi: 10.1097/CORR.0000000000003660.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Psychological therapies for panic disorder with or without agoraphobia in adults: a network meta-analysis.

Cochrane Database Syst Rev. 2016 Apr 13;4(4):CD011004. doi: 10.1002/14651858.CD011004.pub2.

本文引用的文献

Brain-wide cell-type-specific transcriptomic signatures of healthy ageing in mice.

Nature. 2025 Feb;638(8049):182-196. doi: 10.1038/s41586-024-08350-8. Epub 2025 Jan 1.

Spatial transcriptomic analysis of adult hippocampal neurogenesis in the human brain.

J Psychiatry Neurosci. 2024 Oct 16;49(5):E319-E333. doi: 10.1503/jpn.240026. Print 2024 Sep-Oct.

An orchestra of machine learning methods reveals landmarks in single-cell data exemplified with aging fibroblasts.

PLoS One. 2024 Apr 17;19(4):e0302045. doi: 10.1371/journal.pone.0302045. eCollection 2024.

Author Correction: Generation of human islet cell type-specific identity genesets.

Nat Commun. 2024 Mar 22;15(1):2574. doi: 10.1038/s41467-024-46525-z.

Predicting potential target genes in molecular biology experiments using machine learning and multifaceted data sources.

iScience. 2024 Feb 23;27(3):109309. doi: 10.1016/j.isci.2024.109309. eCollection 2024 Mar 15.

A comparison of marker gene selection methods for single-cell RNA sequencing data.

Genome Biol. 2024 Feb 26;25(1):56. doi: 10.1186/s13059-024-03183-0.

Single-cell analysis technologies for cancer research: from tumor-specific single cell discovery to cancer therapy.

Front Genet. 2023 Oct 12;14:1276959. doi: 10.3389/fgene.2023.1276959. eCollection 2023.

Metadata integrity in bioinformatics: Bridging the gap between data and knowledge.

Comput Struct Biotechnol J. 2023 Oct 5;21:4895-4913. doi: 10.1016/j.csbj.2023.10.006. eCollection 2023.

Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning.

Comput Struct Biotechnol J. 2023 Jun 5;21:3293-3314. doi: 10.1016/j.csbj.2023.06.002. eCollection 2023.

Human cortical spheroids with a high diversity of innately developing brain cell types.

Stem Cell Res Ther. 2023 Mar 23;14(1):50. doi: 10.1186/s13287-023-03261-3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

gSELECT：一个新型的预分析机器学习库，可在单细胞数据中进行早期假设检验和预测性基因选择。

gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.

作者信息

Caliskan Deniz, Caliskan Aylin, Dandekar Thomas, Breitenbach Tim

机构信息

Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, Würzburg D-97074, Germany.

出版信息

Comput Struct Biotechnol J. 2025 Aug 5;27:3510-3527. doi: 10.1016/j.csbj.2025.07.047. eCollection 2025.

DOI:10.1016/j.csbj.2025.07.047

PMID:40821713

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12354962/

Abstract

摘要

gSELECT：一个新型的预分析机器学习库，可在单细胞数据中进行早期假设检验和预测性基因选择。

gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

gSELECT：一个新型的预分析机器学习库，可在单细胞数据中进行早期假设检验和预测性基因选择。

gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献