在数分钟内对 10 万个蛋白质结构 decoys 进行聚类。

Clustering 100,000 protein structure decoys in minutes.

机构信息

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):765-73. doi: 10.1109/TCBB.2011.142.

DOI:10.1109/TCBB.2011.142

PMID:22025764

Abstract

Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER’s selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.

摘要

从头开始的蛋白质结构预测方法首先生成大量的结构构象作为候选者（称为诱饵），然后通过聚类技术选择最具代表性的诱饵。由于对两两距离的计算，经典聚类方法效率低下，因此当诱饵数量很大时，该方法变得不可行。此外，现有的聚类方法在确定簇内蛋白质的距离阈值时存在任意性：小的距离阈值会导致许多小簇，而大的距离阈值会导致几个独立的簇合并为一个簇。在本文中，我们通过快速估计聚类中心和有效修剪旋转空间来提出一种有效的聚类方法。簇的数量通过信息距离标准自动检测。因此，实现了一个名为 ONION 的免费下载包。基准数据集上的实验结果表明，ONION 比现有工具快 14 倍，对于 31 个目标，ONION 的选择优于 SPICKER 的选择，而对于 19 个目标，ONION 的选择不如 SPICKER 的选择。在普通 PC 上，ONION 可以在大约 12 分钟内对 100,000 个诱饵进行聚类。