Department of Pharmacology and Systems Therapeutics, Systems Biology Center New York (SBCNY), Icahn School of Medicine at Mount Sinai School, New York, NY 10029, USA.
BMC Bioinformatics. 2014 Mar 21;15:79. doi: 10.1186/1471-2105-15-79.
Identifying differentially expressed genes (DEG) is a fundamental step in studies that perform genome wide expression profiling. Typically, DEG are identified by univariate approaches such as Significance Analysis of Microarrays (SAM) or Linear Models for Microarray Data (LIMMA) for processing cDNA microarrays, and differential gene expression analysis based on the negative binomial distribution (DESeq) or Empirical analysis of Digital Gene Expression data in R (edgeR) for RNA-seq profiling.
Here we present a new geometrical multivariate approach to identify DEG called the Characteristic Direction. We demonstrate that the Characteristic Direction method is significantly more sensitive than existing methods for identifying DEG in the context of transcription factor (TF) and drug perturbation responses over a large number of microarray experiments. We also benchmarked the Characteristic Direction method using synthetic data, as well as RNA-Seq data. A large collection of microarray expression data from TF perturbations (73 experiments) and drug perturbations (130 experiments) extracted from the Gene Expression Omnibus (GEO), as well as an RNA-Seq study that profiled genome-wide gene expression and STAT3 DNA binding in two subtypes of diffuse large B-cell Lymphoma, were used for benchmarking the method using real data. ChIP-Seq data identifying DNA binding sites of the perturbed TFs, as well as known drug targets of the perturbing drugs, were used as prior knowledge silver-standard for validation. In all cases the Characteristic Direction DEG calling method outperformed other methods. We find that when drugs are applied to cells in various contexts, the proteins that interact with the drug-targets are differentially expressed and more of the corresponding genes are discovered by the Characteristic Direction method. In addition, we show that the Characteristic Direction conceptualization can be used to perform improved gene set enrichment analyses when compared with the gene-set enrichment analysis (GSEA) and the hypergeometric test.
The application of the Characteristic Direction method may shed new light on relevant biological mechanisms that would have remained undiscovered by the current state-of-the-art DEG methods. The method is freely accessible via various open source code implementations using four popular programming languages: R, Python, MATLAB and Mathematica, all available at: http://www.maayanlab.net/CD.
鉴定差异表达基因(DEG)是进行全基因组表达谱分析的基础步骤。通常,通过单变量方法(如微阵列的显著性分析(SAM)或微阵列数据的线性模型(LIMMA))处理 cDNA 微阵列,以及基于负二项分布的差异基因表达分析(DESeq)或 R 中的数字基因表达数据的经验分析(edgeR)进行 RNA-seq 分析,来鉴定 DEG。
在这里,我们提出了一种新的用于鉴定 DEG 的几何多元方法,称为特征方向。我们证明,在大量微阵列实验中,与现有的鉴定转录因子(TF)和药物扰动响应的 DEG 的方法相比,特征方向方法在鉴定 DEG 方面更为敏感。我们还使用合成数据以及 RNA-Seq 数据对特征方向方法进行了基准测试。从基因表达综合数据库(GEO)中提取的 TF 扰动(73 个实验)和药物扰动(130 个实验)的大量微阵列表达数据,以及全基因组基因表达和 STAT3 DNA 结合的两个弥漫性大 B 细胞淋巴瘤亚型的 RNA-Seq 研究,都用于使用真实数据对该方法进行基准测试。扰动 TF 的 ChIP-Seq 数据识别 DNA 结合位点以及扰动药物的已知药物靶点,被用作验证的先验知识银标准。在所有情况下,特征方向 DEG 调用方法都优于其他方法。我们发现,当药物在不同的情况下应用于细胞时,与药物靶点相互作用的蛋白质会表现出差异表达,并且特征方向方法发现了更多的相应基因。此外,我们还表明,与基因集富集分析(GSEA)和超几何检验相比,特征方向的概念化可以用于进行改进的基因集富集分析。
特征方向方法的应用可能会揭示当前最先进的 DEG 方法所未发现的相关生物学机制。该方法可通过使用四种流行编程语言的各种开源代码实现免费获得:R、Python、MATLAB 和 Mathematica,所有这些都可在:http://www.maayanlab.net/CD 上获得。