Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America.
PLoS One. 2011 Apr 14;6(4):e18851. doi: 10.1371/journal.pone.0018851.
High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature.
GCAT is freely available at http://binf1.memphis.edu/gcat.
高通量基因组技术使研究人员能够识别与特定实验条件相关的共同调节基因。已经开发了许多统计方法来识别差异表达基因。由于每种方法都可以产生不同的基因集,因此生物学家很难确定哪种统计方法产生了生物学上相关的基因集,并适合他们的研究。为了解决这个问题,我们实施了潜在语义索引(LSI)来确定基因集的功能一致性。使用超过 100 万篇 Medline 摘要和 Entrez Gene 中注释的超过 20000 个小鼠和人类基因构建了 LSI 模型。使用 LSI 衍生的基因间相似性,使用 Fisher 精确检验计算给定基因集的文献凝聚 p 值(LPv)。我们使用基因本体论(GO)中注释的 6000 多个功能途径中的基因对这种方法进行了测试,发现 GO 生物过程类别中的约 75%的基因集和 GO 分子功能和细胞成分类别的 90%的基因集具有功能一致性(LPv<0.05)。这些结果表明 LPv 方法既稳健又准确。将该方法应用于先前发表的微阵列数据集表明,LPv 有助于选择合适的特征提取方法。为了能够实时计算小鼠或人类基因集的 LPv,我们开发了一个名为基因集凝聚分析工具(GCAT)的网络工具。GCAT 可以通过确定数据集的整体功能凝聚来补充其他基因集富集方法,同时考虑生物医学文献中报告的显式和隐式基因相互作用。
GCAT 可在 http://binf1.memphis.edu/gcat 免费获得。