Lu Yong, Rosenfeld Roni, Bar-Joseph Ziv
School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA.
Bioinformatics. 2006 Jul 15;22(14):e314-22. doi: 10.1093/bioinformatics/btl229.
The expression of genes during the cell division process has now been studied in many different species. An important goal of these studies is to identify the set of cycling genes. To date, this was done independently for each of the species studied. Due to noise and other data analysis problems, accurately deriving a set of cycling genes from expression data is a hard problem. This is especially true for some of the multicellular organisms, including humans.
Here we present the first algorithm that combines microarray expression data from multiple species for identifying cycling genes. Our algorithm represents genes from multiple species as nodes in a graph. Edges between genes represent sequence similarity. Starting with the measured expression values for each species we use Belief Propagation to determine a posterior score for genes. This posterior is used to determine a new set of cycling genes for each species. We applied our algorithm to improve the identification of the set of cell cycle genes in budding yeast and humans. As we show, by incorporating sequence similarity information we were able to obtain a more accurate set of genes compared to methods that rely on expression data alone. Our method was especially successful for the human dataset indicating that it can use a high quality dataset from one species to overcome noise problems in another.
C implementation is available from the supporting website: http://www.cs.cmu.edu/~lyongu/pub/cellcycle/.
目前已在许多不同物种中研究了细胞分裂过程中基因的表达情况。这些研究的一个重要目标是识别出一组循环基因。迄今为止,这是针对每个研究物种独立完成的。由于噪声和其他数据分析问题,从表达数据中准确推导一组循环基因是一个难题。对于包括人类在内的一些多细胞生物来说尤其如此。
在此,我们提出了第一种结合多个物种的微阵列表达数据来识别循环基因的算法。我们的算法将多个物种的基因表示为图中的节点。基因之间的边表示序列相似性。从每个物种的测量表达值开始,我们使用信念传播来确定基因的后验分数。这个后验用于为每个物种确定一组新的循环基因。我们应用我们的算法来改进对芽殖酵母和人类中细胞周期基因集的识别。如我们所示,与仅依赖表达数据的方法相比,通过纳入序列相似性信息,我们能够获得一组更准确的基因。我们的方法在人类数据集上特别成功,表明它可以利用一个物种的高质量数据集来克服另一个物种中的噪声问题。