Law Simon R, Kellgren Therese G, Björk Rafael, Ryden Patrik, Keech Olivier
Department of Plant Physiology, Umeå Plant Science Centre, Umeå Universitet, Umeå, Sweden.
Department of Mathematics and Mathematical Statistics, Umeå Universitet, Umeå, Sweden.
Front Plant Sci. 2020 Jun 4;11:524. doi: 10.3389/fpls.2020.00524. eCollection 2020.
Gene co-expression networks (GCNs) can be prepared using a variety of mathematical approaches based on data sampled across diverse developmental processes, tissue types, pathologies, mutant backgrounds, and stress conditions. These networks are used to identify genes with similar expression dynamics but are prone to introducing false-positive and false-negative relationships, especially in the instance of large and heterogenous datasets. With the aim of optimizing the relevance of edges in GCNs and enhancing global biological insight, we propose a novel approach that involves a data-centering step performed simultaneously per gene and per sub-experiment, called centralization within sub-experiments (CSE). Using a gene set encoding the plant mitochondrial proteome as a case study, our results show that all CSE-based GCNs assessed had significantly more edges within the majority of the considered functional sub-networks, such as the mitochondrial electron transport chain and its complexes, than GCNs not using CSE; thus demonstrating that CSE-based GCNs are efficient at predicting canonical functions and associated pathways, here referred to as the core gene network. Furthermore, we show that correlation analyses using CSE-processed data can be used to fine-tune prediction of the function of uncharacterized genes; while its use in combination with analyses based on non-CSE data can augment conventional stress analyses with the innate connections underpinning the dynamic system being examined. Therefore, CSE is an effective alternative method to conventional batch correction approaches, particularly when dealing with large and heterogenous datasets. The method is easy to implement into a pre-existing GCN analysis pipeline and can provide enhanced biological relevance to conventional GCNs by allowing users to delineate a core gene network.
Gene co-expression networks (GCNs) are the product of a variety of mathematical approaches that identify causal relationships in gene expression dynamics but are prone to the misdiagnoses of false-positives and false-negatives, especially in the instance of large and heterogenous datasets. In light of the burgeoning output of next-generation sequencing projects performed on a variety of species, and developmental or clinical conditions; the statistical power and complexity of these networks will undoubtedly increase, while their biological relevance will be fiercely challenged. Here, we propose a novel approach to generate a "core" GCN with enhanced biological relevance. Our method involves a data-centering step that effectively removes all primary treatment/tissue effects, which is simple to employ and can be easily implemented into pre-existing GCN analysis pipelines. The gain in biological relevance resulting from the adoption of this approach was assessed using a plant mitochondrial case study.
基因共表达网络(GCN)可以基于在不同发育过程、组织类型、病理学、突变背景和应激条件下采样的数据,使用多种数学方法来构建。这些网络用于识别具有相似表达动态的基因,但容易引入假阳性和假阴性关系,特别是在大型异质数据集的情况下。为了优化GCN中边的相关性并增强整体生物学见解,我们提出了一种新颖的方法,该方法涉及在每个基因和每个子实验中同时执行的数据中心化步骤,称为子实验内中心化(CSE)。以编码植物线粒体蛋白质组的基因集为例进行研究,我们的结果表明,与未使用CSE的GCN相比,所有基于CSE评估的GCN在大多数考虑的功能子网中,如线粒体电子传递链及其复合物,具有明显更多的边;从而证明基于CSE的GCN在预测经典功能和相关途径方面是有效的,这里称为核心基因网络。此外,我们表明使用CSE处理后的数据进行的相关性分析可用于微调未表征基因功能的预测;而将其与基于非CSE数据的分析结合使用,可以通过所研究动态系统的内在联系来增强传统的应激分析。因此,CSE是传统批次校正方法的一种有效替代方法,特别是在处理大型异质数据集时。该方法易于在现有的GCN分析流程中实现,并且通过允许用户描绘核心基因网络,可以为传统GCN提供更高的生物学相关性。
基因共表达网络(GCN)是多种数学方法的产物,这些方法可识别基因表达动态中的因果关系,但容易出现假阳性和假阴性的误诊,特别是在大型异质数据集的情况下。鉴于对各种物种以及发育或临床条件进行的下一代测序项目的产出迅速增加;这些网络的统计能力和复杂性无疑会增加,而它们的生物学相关性将受到严峻挑战。在此,我们提出了一种新颖的方法来生成具有更高生物学相关性的“核心”GCN。我们的方法涉及一个数据中心化步骤,该步骤有效地消除了所有主要处理/组织效应,易于应用,并且可以轻松地在现有的GCN分析流程中实现。通过植物线粒体案例研究评估了采用此方法所带来的生物学相关性的提升。