Genome Center, University of California Davis, 451 Health Science Dr, Davis, CA, 95616, United States of America.
BMC Bioinformatics. 2012 Sep 28;13:253. doi: 10.1186/1471-2105-13-253.
In previous studies, gene neighborhoods-spatial clusters of co-expressed genes in the genome-have been defined using arbitrary rules such as requiring adjacency, a minimum number of genes, a fixed window size, or a minimum expression level. In the current study, we developed a Gene Neighborhood Scoring Tool (G-NEST) which combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all possible window sizes simultaneously.
Using G-NEST on atlases of mouse and human tissue expression data, we found that large neighborhoods of ten or more genes are extremely rare in mammalian genomes. When they do occur, neighborhoods are typically composed of families of related genes. Both the highest scoring and the largest neighborhoods in mammalian genomes are formed by tandem gene duplication. Mammalian gene neighborhoods contain highly and variably expressed genes. Co-localized noisy gene pairs exhibit lower evolutionary conservation of their adjacent genome locations, suggesting that their shared transcriptional background may be disadvantageous. Genes that are essential to mammalian survival and reproduction are less likely to occur in neighborhoods, although neighborhoods are enriched with genes that function in mitosis. We also found that gene orientation and protein-protein interactions are partially responsible for maintenance of gene neighborhoods.
Our experiments using G-NEST confirm that tandem gene duplication is the primary driver of non-random gene order in mammalian genomes. Non-essentiality, co-functionality, gene orientation, and protein-protein interactions are additional forces that maintain gene neighborhoods, especially those formed by tandem duplicates. We expect G-NEST to be useful for other applications such as the identification of core regulatory modules, common transcriptional backgrounds, and chromatin domains. The software is available at http://docpollard.org/software.html.
在以前的研究中,基因邻域——基因组中共同表达基因的空间聚类——是使用任意规则定义的,例如要求邻接、最小基因数、固定窗口大小或最小表达水平。在本研究中,我们开发了一种基因邻域评分工具(G-NEST),该工具结合了基因组位置、基因表达和进化序列保守性数据,同时对所有可能的窗口大小的假定基因邻域进行评分。
使用 G-NEST 在小鼠和人类组织表达图谱上,我们发现哺乳动物基因组中十个或更多基因的大邻域极为罕见。当它们确实发生时,邻域通常由相关基因家族组成。哺乳动物基因组中得分最高和最大的邻域都是由串联基因复制形成的。哺乳动物基因邻域包含高度和可变表达的基因。共定位的嘈杂基因对其相邻基因组位置的进化保守性较低,表明它们共享的转录背景可能不利。对哺乳动物生存和繁殖至关重要的基因不太可能出现在邻域中,尽管邻域富含在有丝分裂中起作用的基因。我们还发现基因取向和蛋白质-蛋白质相互作用部分负责维持基因邻域。
我们使用 G-NEST 进行的实验证实,串联基因复制是哺乳动物基因组中非随机基因顺序的主要驱动因素。非必需性、共功能性、基因取向和蛋白质-蛋白质相互作用是维持基因邻域的其他力量,特别是那些由串联重复形成的邻域。我们期望 G-NEST 对其他应用有用,例如识别核心调控模块、常见转录背景和染色质域。该软件可在 http://docpollard.org/software.html 获得。