Pandey Ashutosh K, Lu Lu, Wang Xusheng, Homayouni Ramin, Williams Robert W
UT Center for Integrative and Translational Genomics and Department of Anatomy and Neurobiology, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America.
UT Center for Integrative and Translational Genomics and Department of Anatomy and Neurobiology, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America ; St. Jude Children's Research Hospital, Memphis, Tennessee, United States of America.
PLoS One. 2014 Feb 11;9(2):e88889. doi: 10.1371/journal.pone.0088889. eCollection 2014.
What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed--the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum--a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases--ELMOD1, TMEM88B, and DZANK1--we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes.
在特定组织、细胞或系统中强烈且选择性表达的基因中,有多大比例在生物学功能方面仍几乎完全未被表征?这些功能未知的基因与已充分研究的基因在哪些方面存在差异?为了解决这两个问题,我们设计了一种计算方法来定义所谓的未知基因组。作为原理验证,我们提取并分析了在大脑中强烈且选择性表达的一大组基因。我们发现与该组基因相关的出版物高度不均衡——排名前5%的基因吸收了70%的相关文献。相比之下,约20%的基因基本上没有神经科学文献。对过去十年未知基因组的分析表明,它顽固地持续存在,神经科学文献的快速增长并未对这些基因的数量产生预期效果。令人惊讶的是,未知基因组基因在共表达网络中的连接性方面与已充分研究的基因并无差异。在直系同源基因、旁系同源基因或蛋白质结构域的数量方面也没有差异。这两组基因之间的主要区别特征是发现日期,早期发现与更大的研究动力相关——一种基因组跟风效应。最后,我们探讨了大规模基因组、成像和表型数据集能在多大程度上用于为整个未知基因组提供高通量功能注释。在大多数情况下,我们能够为这些被忽视的基因提取并添加重要信息。在几个案例中——ELMOD1、TMEM88B和DZANK1——我们利用序列多态性、大型表型数据集和反向遗传学方法来评估未知基因组基因的功能。