Burkart Mark F, Wren Jonathan D, Herschkowitz Jason I, Perou Charles M, Garner Harold R
Department of Internal Medicine, The McDermott Center for Human Growth and Development, Division of Translational Research, The University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA.
Bioinformatics. 2007 Aug 1;23(15):1995-2003. doi: 10.1093/bioinformatics/btm261. Epub 2007 May 30.
Microarrays rapidly generate large quantities of gene expression information, but interpreting such data within a biological context is still relatively complex and laborious. New methods that can identify functionally related genes via shared literature concepts will be useful in addressing these needs.
We have developed a novel method that uses implicit literature relationships (concepts related via shared, intermediate concepts) to cluster related genes. Genes are evaluated for implicit connections within a network of biomedical objects (other genes, ontological concepts and diseases) that are connected via their co-occurrences in Medline titles and/or abstracts. On the basis of these implicit relationships, individual gene pairs are scored using a probability-based algorithm. Scores are generated for all pairwise combinations of genes, which are then clustered based on the scores. We applied this method to a test set composed of nine functional groups with known relationships. The method scored highly for all nine groups and significantly better than a benchmark co-occurrence-based method for six groups. We then applied this method to gene sets specific to two previously defined breast tumor subtypes. Analysis of the results recapitulated known biological relationships and identified novel pathway relationships unique to each tumor subtype. We demonstrate that this method provides a valuable new means of identifying and visualizing significantly related genes within gene lists via their implicit relationships in the literature.
微阵列可快速生成大量基因表达信息,但在生物学背景下解读此类数据仍相对复杂且费力。能够通过共享文献概念识别功能相关基因的新方法,将有助于满足这些需求。
我们开发了一种新方法,该方法利用隐含的文献关系(通过共享的中间概念相关的概念)对相关基因进行聚类。在通过它们在Medline标题和/或摘要中的共现而连接的生物医学对象(其他基因、本体概念和疾病)网络中,评估基因之间的隐含联系。基于这些隐含关系,使用基于概率的算法对单个基因对进行评分。为基因的所有成对组合生成分数,然后根据这些分数进行聚类。我们将此方法应用于由九个具有已知关系的功能组组成的测试集。该方法在所有九个组中得分都很高,并且在六个组中明显优于基于共现的基准方法。然后,我们将此方法应用于特定于两种先前定义的乳腺肿瘤亚型的基因集。对结果的分析概括了已知的生物学关系,并确定了每种肿瘤亚型特有的新途径关系。我们证明,该方法通过基因在文献中的隐含关系,为识别和可视化基因列表中显著相关的基因提供了一种有价值的新方法。