Hibbs Matthew A, Hess David C, Myers Chad L, Huttenhower Curtis, Li Kai, Troyanskaya Olga G
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA.
Bioinformatics. 2007 Oct 15;23(20):2692-9. doi: 10.1093/bioinformatics/btm403. Epub 2007 Aug 27.
The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium.
We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community.
Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL.
Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement.
基因表达微阵列技术的可用性不断提高,已促成了数千个研究各种生物学条件的微阵列基因表达数据集的发表。由于缺乏对整个数据集进行快速、准确探索的方法,这个庞大的知识库仍未得到充分利用。
我们收集了包含约2400个实验条件的酿酒酵母基因表达微阵列数据。我们分析了该数据集的功能覆盖范围,并设计了一种上下文敏感搜索算法,用于快速探索该数据集。使用我们系统的研究人员提供一小组查询基因以建立生物学搜索上下文;基于此查询,我们对每个数据集与上下文的相关性进行加权,并在这些加权数据集中识别与查询集共表达的其他基因。在概括已知生物学信息时,与之前的超级聚类方法相比,我们的方法准确率平均提高了273%。此外,我们发现我们的搜索范式能够识别可通过进一步实验验证的新生物学预测。我们的方法使生物学研究人员能够以有助于得出结论和形成假设的方式探索现有微阵列数据的全部内容,我们认为这对研究界来说是非常宝贵的。
我们的查询驱动搜索引擎名为SPELL,可在http://function.princeton.edu/SPELL获取。
其他几个数据文件、图表和讨论可在http://function.princeton.edu/SPELL/supplement获取。