Computational Biology Group, Department of Clinical Laboratory Sciences, University of Cape Town, Cape Town, South Africa.
BMC Bioinformatics. 2011 Jan 24;12:29. doi: 10.1186/1471-2105-12-29.
In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research.
We show that paralogs, which typically have high sequence identity and similar molecular functions, also exhibit high correlation in their expression patterns. We investigate this correlation as a potential confounding factor common to current GSA methods using Indygene http://www.cbio.uct.ac.za/indygene, a web tool that reduces a supplied list of genes so that it includes no pairwise paralogy relationships above a specified sequence similarity threshold. We use the tool to reanalyse previously published microarray datasets and determine the potential utility of accounting for the presence of paralogs.
The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses. This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.
为了解释微阵列实验获得的结果,研究人员通常将重点从分析单个差异表达基因转移到分析基因集。这些基因集分析(GSA)方法利用先前积累的生物学知识将基因分组,然后旨在以反映它们在特定实验情况下相对重要性的方式对这些基因集进行排序。我们怀疑旁系同源物的存在会影响 GSA 方法准确识别后续研究中最重要的基因集的能力。
我们表明,旁系同源物通常具有高度的序列同一性和相似的分子功能,它们的表达模式也表现出高度的相关性。我们使用 Indygene http://www.cbio.uct.ac.za/indygene 作为当前 GSA 方法的潜在混杂因素进行了调查,这是一种网络工具,可减少提供的基因列表,使其不包含指定序列相似性阈值以上的任何成对旁系同源关系。我们使用该工具重新分析了先前发表的微阵列数据集,并确定了考虑旁系同源物存在的潜在效用。
Indygene 工具可有效地从给定数据集中去除旁系同源关系,我们发现,在 GSA 之前进行这种减少可以生成具有显著不同结果的能力,这些结果通常代表新颖和合理的生物学假设。当应用于重新分析先前发表的微阵列数据集时,这三种不同的 GSA 方法均证明了这一点,这表明旁系同源物的冗余性和非独立性是处理 GSA 方法学时的重要考虑因素。