Thiel Philipp, Sach-Peltason Lisa, Ottmann Christian, Kohlbacher Oliver
Applied Bioinformatics, Center for Bioinformatics, Quantitative Biology Center and Dept. of Computer Science, University of Tübingen , Sand 14, 72076 Tübingen, Germany.
J Chem Inf Model. 2014 Sep 22;54(9):2395-401. doi: 10.1021/ci500150t. Epub 2014 Sep 2.
The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
基于指纹计算成对化合物相似度是化学信息学中的基本任务之一。高效计算化合物相似度的方法对于诸如相似度搜索或库聚类等各种应用至关重要。随着公共化合物数据库规模的不断增大,对这些数据库进行精确聚类是很有必要的,但通常计算成本过高。我们提出了一种优化的倒排索引算法,用于计算给定数据集二维指纹的所有成对相似度。与其他算法不同,它既不需要GPU计算,也不会产生聚类的随机近似值。该算法设计为能很好地适用于多核架构,并具有出色的并行加速比。作为此算法的一个应用示例,我们实现了一个确定性聚类应用程序,其设计目的是在当前硬件上短时间内分解包含数千万种化合物的虚拟库。我们的结果表明,我们的实现方案在普通桌面CPU上每秒可完成超过4亿次Tanimoto相似度计算。因此,在现代多核机器上,可用化学空间的确定性聚类可以在几天内完成。