Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM), 28029, Madrid, Spain.
Instituto de Investigaciones Biomédicas "Alberto Sols" (CSIC-UAM), 28029, Madrid, Spain.
BMC Bioinformatics. 2022 May 31;23(1):204. doi: 10.1186/s12859-022-04741-8.
Molecular gene signatures are useful tools to characterize the physiological state of cell populations, but most have developed under a narrow range of conditions and cell types and are often restricted to a set of gene identities. Focusing on the transcriptional response to hypoxia, we aimed to generate widely applicable classifiers sourced from the results of a meta-analysis of 69 differential expression datasets which included 425 individual RNA-seq experiments from 33 different human cell types exposed to different degrees of hypoxia (0.1-5%[Formula: see text]) for 2-48 h. The resulting decision trees include both gene identities and quantitative boundaries, allowing for easy classification of individual samples without control or normoxic reference. Each tree is composed of 3-5 genes mostly drawn from a small set of just 8 genes (EGLN1, MIR210HG, NDRG1, ANKRD37, TCAF2, PFKFB3, BHLHE40, and MAFF). In spite of their simplicity, these classifiers achieve over 95% accuracy in cross validation and over 80% accuracy when applied to additional challenging datasets. Our results indicate that the classifiers are able to identify hypoxic tumor samples from bulk RNAseq and hypoxic regions within tumor from spatially resolved transcriptomics datasets. Moreover, application of the classifiers to histological sections from normal tissues suggest the presence of a hypoxic gene expression pattern in the kidney cortex not observed in other normoxic organs. Finally, tree classifiers described herein outperform traditional hypoxic gene signatures when compared against a wide range of datasets. This work describes a set of hypoxic gene signatures, structured as simple decision tress, that identify hypoxic samples and regions with high accuracy and can be applied to a broad variety of gene expression datasets and formats.
分子基因特征是描述细胞群体生理状态的有用工具,但大多数特征都是在狭窄的条件和细胞类型范围内开发的,并且通常仅限于一组基因身份。我们专注于缺氧的转录反应,旨在从对 69 个差异表达数据集的荟萃分析结果中生成广泛适用的分类器,这些数据集包括来自 33 种不同人类细胞类型的 425 个单独的 RNA-seq 实验,这些细胞类型暴露于不同程度的缺氧(0.1-5%[Formula: see text])2-48 小时。由此产生的决策树既包括基因身份又包括定量边界,允许在没有对照或常氧参考的情况下轻松对单个样本进行分类。每棵树由 3-5 个基因组成,这些基因主要来自一小组仅 8 个基因(EGLN1、MIR210HG、NDRG1、ANKRD37、TCAF2、PFKFB3、BHLHE40 和 MAFF)。尽管它们很简单,但这些分类器在交叉验证中准确率超过 95%,在应用于其他具有挑战性的数据集时准确率超过 80%。我们的结果表明,分类器能够从批量 RNAseq 中识别缺氧肿瘤样本,并从空间分辨转录组学数据集中识别肿瘤内的缺氧区域。此外,将分类器应用于正常组织的组织切片表明,在肾脏皮质中存在缺氧基因表达模式,而在其他常氧器官中则没有观察到这种模式。最后,与广泛的数据集相比,本文所述的树分类器优于传统的缺氧基因特征。这项工作描述了一组缺氧基因特征,这些特征以简单的决策树的形式构建,可准确识别缺氧样本和区域,并可应用于广泛的基因表达数据集和格式。