Lane Center for Computational Biology, Carnegie Mellon University Pittsburgh, PA 15213, USA.
Bioinformatics. 2012 Jun 15;28(12):i258-64. doi: 10.1093/bioinformatics/bts205.
With the vast increase in the number of gene expression datasets deposited in public databases, novel techniques are required to analyze and mine this wealth of data. Similar to the way BLAST enables cross-species comparison of sequence data, tools that enable cross-species expression comparison will allow us to better utilize these datasets: cross-species expression comparison enables us to address questions in evolution and development, and further allows the identification of disease-related genes and pathways that play similar roles in humans and model organisms. Unlike sequence, which is static, expression data changes over time and under different conditions. Thus, a prerequisite for performing cross-species analysis is the ability to match experiments across species.
To enable better cross-species comparisons, we developed methods for automatically identifying pairs of similar expression datasets across species. Our method uses a co-training algorithm to combine a model of expression similarity with a model of the text which accompanies the expression experiments. The co-training method outperforms previous methods based on expression similarity alone. Using expert analysis, we show that the new matches identified by our method indeed capture biological similarities across species. We then use the matched expression pairs between human and mouse to recover known and novel cycling genes as well as to identify genes with possible involvement in diabetes. By providing the ability to identify novel candidate genes in model organisms, our method opens the door to new models for studying diseases.
Source code and supplementary information is available at: www.andrew.cmu.edu/user/aaronwis/cotrain12.
随着越来越多的基因表达数据集存入公共数据库,我们需要新的技术来分析和挖掘这些丰富的数据。类似于 BLAST 使序列数据在不同物种之间的比较成为可能,使表达数据在不同物种之间进行比较的工具将使我们能够更好地利用这些数据集:在进化和发育方面的比较使我们能够提出问题,进一步确定在人类和模式生物中发挥类似作用的与疾病相关的基因和途径。与序列不同,表达数据是动态的,会随时间和条件的变化而变化。因此,进行跨物种分析的前提是能够在不同物种之间匹配实验。
为了实现更好的跨物种比较,我们开发了方法来自动识别跨物种的相似表达数据集对。我们的方法使用协同训练算法将表达相似性模型与表达实验所伴随的文本模型结合起来。协同训练方法优于以前仅基于表达相似性的方法。通过专家分析,我们表明,我们的方法新识别的匹配确实捕捉到了不同物种之间的生物学相似性。然后,我们使用人类和小鼠之间的匹配表达对来恢复已知和新的循环基因,并识别可能与糖尿病有关的基因。通过提供在模式生物中识别新候选基因的能力,我们的方法为研究疾病的新模型打开了大门。
源代码和补充信息可在:www.andrew.cmu.edu/user/aaronwis/cotrain12 获得。