Li Weizhong
Burnham Institute for Medical Research, 10901 N. Torrey Pines Rd., La Jolla, California 92037, USA.
J Chem Inf Model. 2006 Sep-Oct;46(5):1919-23. doi: 10.1021/ci0600859.
As a result of the recent developments of high-throughput screening in drug discovery, the number of available screening compounds has been growing rapidly. Chemical vendors provide millions of compounds; however, these compounds are highly redundant. Clustering analysis, a technique that groups similar compounds into families, can be used to analyze such redundancy. Many available clustering methods focus on accurate classification of compounds; they are slow and are not suitable for very large compound libraries. Here is described a fast clustering method based on an incremental clustering algorithm and the 2D fingerprints of compounds. This method can cluster a very large data set with millions of compounds in hours on a single computer. A program implemented with this method, called cd-hit-fp, is available from http://chemspace.org.
由于药物发现中高通量筛选的最新进展,可用筛选化合物的数量一直在迅速增长。化学供应商提供数百万种化合物;然而,这些化合物高度冗余。聚类分析是一种将相似化合物分组为族的技术,可用于分析这种冗余性。许多现有的聚类方法专注于化合物的准确分类;它们速度慢,不适用于非常大的化合物库。本文描述了一种基于增量聚类算法和化合物二维指纹的快速聚类方法。该方法可以在一台计算机上数小时内对包含数百万种化合物的非常大的数据集进行聚类。使用此方法实现的程序cd-hit-fp可从http://chemspace.org获得。