SketchSort：快速全对相似性搜索大型分子指纹数据库。

SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints.

机构信息

Minato Discrete Structure Manipulation System Project, ERATO, Japan Science and Technology Agency, Sapporo, 060-0814, Japan.

Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, 135-0064, Japan.

出版信息

Mol Inform. 2011 Sep;30(9):801-7. doi: 10.1002/minf.201100050. Epub 2011 Jul 12.

DOI:10.1002/minf.201100050

PMID:27467412

Abstract

Similarity networks of ligands are often reported useful in predicting chemical activities and target proteins. However, the naive method of computing all pairwise similarities of chemical fingerprints takes quadratic time, which is prohibitive for large scale databases with millions of ligands. We propose a fast all pairs similarity search method, called SketchSort, that maps chemical fingerprints to symbol strings with random projections, and finds similar strings by multiple masked sorting. Due to random projection, SketchSort misses a certain fraction of neighbors (i.e., false negatives). Nevertheless, the expected fraction of false negatives is theoretically derived and can be kept under a very small value. Experiments show that SketchSort is much faster than other similarity search methods and enables us to obtain a PubChem-scale similarity network quickly.

摘要

配体的相似性网络通常被报道在预测化学活性和靶蛋白方面很有用。然而，计算化学指纹的所有成对相似性的简单方法需要二次时间，对于具有数百万配体的大规模数据库来说是不可行的。我们提出了一种快速的全对相似性搜索方法，称为 SketchSort，它将化学指纹映射到具有随机投影的符号串，并通过多次掩蔽排序找到相似的字符串。由于随机投影，SketchSort 会错过一定比例的邻居（即假阴性）。然而，假阴性的预期比例是从理论上推导出来的，可以保持在非常小的值。实验表明，SketchSort 比其他相似性搜索方法快得多，使我们能够快速获得 PubChem 规模的相似性网络。