挖掘基因表达数据库中的关联规则。

Mining gene expression databases for association rules.

作者信息

Creighton Chad, Hanash Samir

机构信息

Bioinformatics Program Pediatrics and Communicable Diseases, University of Michigan, Ann Arbor 48109, USA.

出版信息

Bioinformatics. 2003 Jan;19(1):79-86. doi: 10.1093/bioinformatics/19.1.79.

DOI:10.1093/bioinformatics/19.1.79

PMID:12499296

Abstract

MOTIVATION

Global gene expression profiling, both at the transcript level and at the protein level, can be a valuable tool in the understanding of genes, biological networks, and cellular states. As larger and larger gene expression data sets become available, data mining techniques can be applied to identify patterns of interest in the data. Association rules, used widely in the area of market basket analysis, can be applied to the analysis of expression data as well. Association rules can reveal biologically relevant associations between different genes or between environmental effects and gene expression. An association rule has the form LHS --> RHS, where LHS and RHS are disjoint sets of items, the RHS set being likely to occur whenever the LHS set occurs. Items in gene expression data can include genes that are highly expressed or repressed, as well as relevant facts describing the cellular environment of the genes (e.g. the diagnosis of a tumor sample from which a profile was obtained).

RESULTS

We demonstrate an algorithm for efficiently mining association rules from gene expression data, using the data set from Hughes et al. (2000, Cell, 102, 109-126) of 300 expression profiles for yeast. Using the algorithm, we find numerous rules in the data. A cursory analysis of some of these rules reveals numerous associations between certain genes, many of which make sense biologically, others suggesting new hypotheses that may warrant further investigation. In a data set derived from the yeast data set, but with the expression values for each transcript randomly shifted with respect to the experiments, no rules were found, indicating that most all of the rules mined from the actual data set are not likely to have occurred by chance.

AVAILABILITY

An implementation of the algorithm using Microsoft SQL Server with Access 2000 is available at http://dot.ped.med.umich.edu:2000/pub/assoc_rules/assoc_rules.zip. Our results from mining the yeast data set are available at http://dot.ped.med.umich.edu:2000/pub/assoc_rules/yeast_results.zip.

摘要

动机

无论是在转录水平还是蛋白质水平上的全球基因表达谱分析，都可能是理解基因、生物网络和细胞状态的一种有价值的工具。随着越来越大的基因表达数据集变得可用，数据挖掘技术可用于识别数据中感兴趣的模式。在购物篮分析领域广泛使用的关联规则，也可应用于表达数据分析。关联规则能够揭示不同基因之间或环境效应与基因表达之间的生物学相关关联。一条关联规则具有LHS --> RHS的形式，其中LHS和RHS是不相交的项目集，每当LHS集出现时，RHS集就有可能出现。基因表达数据中的项目可以包括高表达或受抑制基因，以及描述基因细胞环境的相关事实（例如从中获得谱图的肿瘤样本的诊断）。

结果

我们展示了一种从基因表达数据中高效挖掘关联规则的算法，使用了休斯等人（2000年，《细胞》，102卷，109 - 126页）提供的酵母300个表达谱的数据集。使用该算法，我们在数据中发现了大量规则。对其中一些规则的粗略分析揭示了某些基因之间的众多关联，其中许多在生物学上是有意义的，其他的则提出了可能值得进一步研究的新假设。在一个源自酵母数据集但每个转录本的表达值相对于实验随机偏移的数据集中，未发现任何规则，这表明从实际数据集中挖掘出的大多数规则不太可能是偶然出现的。