Feltus F Alex, Ficklin Stephen P, Gibson Scott M, Smith Melissa C
Department of Genetics & Biochemistry, Clemson University, 105 Collings Street, Clemson, SC 29634, USA.
BMC Syst Biol. 2013 Jun 5;7:44. doi: 10.1186/1752-0509-7-44.
In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium.
A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network.
Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.
在基因组学中,通过在表达数据集中寻找基因之间显著的成对相关性,构建了高度相关的基因相互作用(共表达)网络。然后挖掘这些网络以阐明多基因水平的生物学功能。在某些情况下,网络可能由在各种不同条件下测量基因表达的输入样本构建,例如针对不同的基因型、环境、疾病状态和组织。当从公共存储库获得大量样本时,将样本关联到特定条件组通常难以管理,并且合并来自各种条件的样本会对网络规模产生负面影响。通常还应用固定的显著性阈值,这也限制了最终网络的规模。因此,我们建议对输入的表达样本进行预聚类,以近似样本的特定条件分组,并对每个组进行单独的网络构建,作为动态显著性阈值化的一种手段。其净效应是提高灵敏度,从而在最终的共表达网络汇编中最大化共表达关系的总数。
对7105个公开可用的拟南芥ATH1 Affymetrix微阵列样本进行k均值划分后,共构建了86个拟南芥共表达网络。我们将每个预排序的网络称为基因相互作用层(GIL)。随机矩阵理论(RMT)是一种无监督阈值化方法,用于独立地对86个网络中的每一个进行阈值化,有效地为网络提供了一个动态(非全局)阈值。所有GIL中的基因总数达到19588个(测量基因覆盖率为94.7%)和558022个独特的共表达关系。相比之下,不对输入样本进行预排序的网络构建在全局网络中仅产生3297个基因(15.9%)和129134个关系。
在这里我们表明,微阵列样本的预聚类有助于近似特定条件的网络,并允许使用无监督方法进行动态阈值化。由于RMT确保只保留高度显著的相互作用,GIL汇编包含了ATH1阵列上几乎所有可测量基因的558022个独特的高质量拟南芥共表达关系。对于拟南芥来说,这些网络代表了迄今为止最大的显著基因共表达关系汇编,是探索这种重点模式植物的复杂途径、多基因和多效性关系的一种手段。这些网络可在sysbio.genome.clemson.edu上进行探索。最后,该方法适用于任何生物体的任何大型表达谱集合,并且最适合于需要独立于知识的网络构建方法的情况。