Hunter L, Taylor R C, Leach S M, Simon R
Center for Computational Pharmacology, Department of Pharmacology, School of Medicine, C236, University of Colorado Health Sciences Center, 4200 E. Ninth Avenue, Denver CO 80206, USA.
Bioinformatics. 2001;17 Suppl 1:S115-22. doi: 10.1093/bioinformatics/17.suppl_1.s115.
Gene expression array technology has made possible the assay of expression levels of tens of thousands of genes at a time; large databases of such measurements are currently under construction. One important use of such databases is the ability to search for experiments that have similar gene expression levels as a query, potentially identifying previously unsuspected relationships among cellular states. Such searches depend crucially on the metric used to assess the similarity between pairs of experiments. The complex joint distribution of gene expression levels, particularly their correlational structure and non-normality, make simple similarity metrics such as Euclidean distance or correlational similarity scores suboptimal for use in this application. We present a similarity metric for gene expression array experiments that takes into account the complex joint distribution of expression values. We provide a computationally tractable approximation to this measure, and have implemented a database search tool based on it. We discuss implementation issues and efficiency, and we compare our new metric to other standard metrics.