Mao Rui, Xu Weijia, Ramakrishnan Smriti, Nuckolls Glen, Miranker Daniel P
Department of Computer Sciences, Center for Computational Biology and Bioinformatics, University of Texas at Austin, 1 University Station C0500, Austin, TX 78712-0233, USA.
Proc IEEE Comput Syst Bioinform Conf. 2005:351-61. doi: 10.1109/csb.2005.42.
Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.
利用基于距离的索引结构进行相似性搜索越来越多地应用于多媒体和生物数据库应用中。我们考虑对三种重要的生物数据类型进行基于距离的索引,即使用PAM模型度量的蛋白质k聚体、使用汉明距离的DNA k聚体以及使用从余弦距离导出的伪度量的肽片段谱。迄今为止,这项研究的主要驱动力一直是多媒体应用,在多媒体应用中,相似性函数通常是高维特征向量上的欧几里得范数。我们得出的结果表明,这些生物工作负载的特性与多媒体工作负载不同。特别是,它们本质上不是非常高维的,因此需要不同的优化启发式方法。基于MVP树,我们开发了一种寻找中心的枢轴选择启发式方法,并表明它优于最广泛使用的寻找角点启发式方法。同样,我们开发了一种对实际数据分布敏感的数据分区方法,以代替中位数分割。