一种用于分析超大型化合物库中高度相似化合物的快速聚类算法。

A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

作者信息

Li Weizhong

机构信息

Burnham Institute for Medical Research, 10901 N. Torrey Pines Rd., La Jolla, California 92037, USA.

出版信息

J Chem Inf Model. 2006 Sep-Oct;46(5):1919-23. doi: 10.1021/ci0600859.

DOI:10.1021/ci0600859

PMID:16995722

Abstract

As a result of the recent developments of high-throughput screening in drug discovery, the number of available screening compounds has been growing rapidly. Chemical vendors provide millions of compounds; however, these compounds are highly redundant. Clustering analysis, a technique that groups similar compounds into families, can be used to analyze such redundancy. Many available clustering methods focus on accurate classification of compounds; they are slow and are not suitable for very large compound libraries. Here is described a fast clustering method based on an incremental clustering algorithm and the 2D fingerprints of compounds. This method can cluster a very large data set with millions of compounds in hours on a single computer. A program implemented with this method, called cd-hit-fp, is available from http://chemspace.org.

摘要

由于药物发现中高通量筛选的最新进展，可用筛选化合物的数量一直在迅速增长。化学供应商提供数百万种化合物；然而，这些化合物高度冗余。聚类分析是一种将相似化合物分组为族的技术，可用于分析这种冗余性。许多现有的聚类方法专注于化合物的准确分类；它们速度慢，不适用于非常大的化合物库。本文描述了一种基于增量聚类算法和化合物二维指纹的快速聚类方法。该方法可以在一台计算机上数小时内对包含数百万种化合物的非常大的数据集进行聚类。使用此方法实现的程序cd-hit-fp可从http://chemspace.org获得。

相似文献

A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

J Chem Inf Model. 2006 Sep-Oct;46(5):1919-23. doi: 10.1021/ci0600859.

A hierarchical clustering approach for large compound libraries.

J Chem Inf Model. 2005 Jul-Aug;45(4):807-15. doi: 10.1021/ci0500029.

An algorithm for clustering cDNA fingerprints.

Genomics. 2000 Jun 15;66(3):249-56. doi: 10.1006/geno.2000.6187.

Analysis of a Gibbs sampler method for model-based clustering of gene expression data.

Bioinformatics. 2008 Jan 15;24(2):176-83. doi: 10.1093/bioinformatics/btm562. Epub 2007 Nov 22.

Efficient layered density-based clustering of categorical data.

J Biomed Inform. 2009 Apr;42(2):365-76. doi: 10.1016/j.jbi.2008.11.004. Epub 2008 Dec 10.

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.

Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.

SeleX-CS: a new consensus scoring algorithm for hit discovery and lead optimization.

J Chem Inf Model. 2009 Mar;49(3):623-33. doi: 10.1021/ci800335j.

A fast, fully automated cell segmentation algorithm for high-throughput and high-content screening.

Cytometry A. 2008 Oct;73(10):958-64. doi: 10.1002/cyto.a.20627.

NIPALSTREE: a new hierarchical clustering approach for large compound libraries and its application to virtual screening.

J Chem Inf Model. 2006 Nov-Dec;46(6):2220-9. doi: 10.1021/ci050541d.

Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles.

Bioinformatics. 2008 Jun 1;24(11):1359-66. doi: 10.1093/bioinformatics/btn133. Epub 2008 Apr 10.

引用本文的文献

BPA: a BERT-based priority annotation strategy for assessing the rationality of aquatic algal protein sequences.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf401.

CFam: a chemical families database based on iterative selection of functional seeds and seed-directed compound clustering.

Nucleic Acids Res. 2015 Jan;43(Database issue):D558-65. doi: 10.1093/nar/gku1212. Epub 2014 Nov 20.

Identification and Preclinical Pharmacology of the γ-Secretase Modulator BMS-869780.

Int J Alzheimers Dis. 2014;2014:431858. doi: 10.1155/2014/431858. Epub 2014 Jul 8.

Discovery of a small-molecule antiviral targeting the HIV-1 matrix protein.

Bioorg Med Chem Lett. 2013 Feb 15;23(4):1132-5. doi: 10.1016/j.bmcl.2012.11.041. Epub 2012 Nov 29.

Transcription Factor DLX5 As a New Target for Promising Antitumor Agents.

Acta Naturae. 2011 Jul;3(3):47-51.

Structure-based drug design of a new chemical class of small molecules active against influenza A nucleoprotein in vitro and in vivo.

PLoS Curr. 2011 Aug 7;3:RRN1253. doi: 10.1371/currents.RRN1253.

3-D clustering: a tool for high throughput docking.

J Mol Model. 2009 May;15(5):551-60. doi: 10.1007/s00894-008-0360-6. Epub 2008 Dec 16.

Counting clusters using R-NN curves.

J Chem Inf Model. 2007 Jul-Aug;47(4):1308-18. doi: 10.1021/ci600541f. Epub 2007 Jun 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于分析超大型化合物库中高度相似化合物的快速聚类算法。

A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献