Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
Bioinformatics. 2009 Nov 1;25(21):2772-9. doi: 10.1093/bioinformatics/btp492. Epub 2009 Aug 18.
When comparing gene expression levels between species or strains using microarrays, sequence differences between the groups can cause false identification of expression differences. Our simulated dataset shows that a sequence divergence of only 1% between species can lead to falsely reported expression differences for >50% of the transcripts-similar levels of effect have been reported previously in comparisons of human and chimpanzee expression. We propose a method for identifying probes that cause such false readings, using only the microarray data, so that problematic probes can be excluded from analysis. We then test the power of the method to detect sequence differences and to correct for falsely reported expression differences. Our method can detect 70% of the probes with sequence differences using human and chimpanzee data, while removing only 18% of probes with no sequence differences. Although only 70% of the probes with sequence differences are detected, the effect of removing probes on falsely reported expression differences is more dramatic: the method can remove 98% of the falsely reported expression differences from a simulated dataset. We argue that the method should be used even when sequence data are available.
Supplementary data are available at Bioinformatics online.
当使用微阵列比较物种或菌株之间的基因表达水平时,组之间的序列差异可能导致表达差异的错误识别。我们的模拟数据集表明,物种之间仅 1%的序列差异就可能导致 >50%的转录本被错误报告为表达差异——在人类和黑猩猩表达的比较中,已经报道了类似水平的影响。我们提出了一种仅使用微阵列数据识别导致这种错误读数的探针的方法,以便可以将有问题的探针从分析中排除。然后,我们测试了该方法检测序列差异和纠正错误报告的表达差异的能力。我们的方法可以使用人类和黑猩猩的数据检测到 70%具有序列差异的探针,而仅去除 18%没有序列差异的探针。尽管仅检测到 70%具有序列差异的探针,但去除探针对错误报告的表达差异的影响更为显著:该方法可以从模拟数据集中去除 98%的错误报告的表达差异。我们认为,即使有序列数据可用,也应使用该方法。
补充数据可在 Bioinformatics 在线获取。