Leong Hui Sun, Kipling David
Department of Pathology, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK.
Nucleic Acids Res. 2009 Jun;37(11):e79. doi: 10.1093/nar/gkp310. Epub 2009 May 8.
A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.
微阵列数据分析中的一个主要挑战是对基因列表进行功能解释。解决这一问题的常用方法是过度表达分析(ORA),它使用超几何检验(或其变体)来评估特定功能定义的基因组在基因列表中出现的频率是否高于随机预期。ORA的现有应用在很大程度上局限于预定义的术语,如GO和KEGG。我们报告了我们对ORA是否可应用于更广泛的自由文本挖掘的探索。我们发现,实验得出的基因列表中一个迄今未被充分认识的特征是,其组成部分与更多的注释相关联,因为它们已经被研究了更长的时间。这种偏差是生物医学界研究活动模式的结果,对于基于经典超几何检验的ORA方法来说是一个主要问题,因为这些方法无法解释这种偏差。因此,我们开发了三种方法来克服这种偏差,并在涵盖不同物种的大量已发表数据集中证明了它们的可用性。与使用GO术语的现有工具进行的比较表明,挖掘PubMed摘要可以揭示仅挖掘预定义本体可能无法获得的额外生物学见解。