Horton Paul B, Kiseleva Larisa, Fujibuchi Wataru
Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
Genome Inform. 2006;17(2):67-76.
In this paper we present a fast algorithm and implementation for computing the Spearman rank correlation (SRC) between a query expression profile and each expression profile in a database of profiles. The algorithm is linear in the size of the profile database with a very small constant factor. It is designed to efficiently handle multiple profile platforms and missing values. We show that our specialized algorithm and C++ implementation can achieve an approximately 100-fold speed-up over a reasonable baseline implementation using Perl hash tables. RaPiDS is designed for general similarity search rather than classification - but in order to attempt to classify the usefulness of SRC as a similarity measure we investigate the usefulness of this program as a classifier for classifying normal human cell types based on gene expression. Specifically we use the k nearest neighbor classifier with a t statistic derived from SRC as the similarity measure for profile pairs. We estimate the accuracy using a jackknife test on the microarray data with manually checked cell type annotation. Preliminary results suggest the measure is useful (64% accuracy on 1,685 profiles vs. the majority class classifier's 17.5%) for profiles measured under similar conditions (same laboratory and chip platform); but requires improvement when comparing profiles from different experimental series.
在本文中,我们提出了一种快速算法及实现方法,用于计算查询表达谱与谱数据库中每个表达谱之间的斯皮尔曼等级相关性(SRC)。该算法在谱数据库大小方面呈线性关系,且常数因子非常小。它旨在高效处理多个谱平台及缺失值。我们表明,相较于使用Perl哈希表的合理基线实现方法,我们的专门算法及C++实现能够实现约100倍的加速。RaPiDS专为一般相似性搜索而非分类设计——但为了尝试将SRC作为相似性度量的有用性进行分类,我们研究了该程序作为基于基因表达对正常人类细胞类型进行分类的分类器的有用性。具体而言,我们使用具有从SRC导出的t统计量的k近邻分类器作为谱对的相似性度量。我们使用留一法检验对带有手动检查的细胞类型注释的微阵列数据估计准确性。初步结果表明,对于在相似条件下(同一实验室和芯片平台)测量的谱,该度量是有用的(在1685个谱上准确率为64%,而多数类分类器的准确率为17.5%);但在比较来自不同实验系列的谱时需要改进。