Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA.
Proteomics. 2013 Jan;13(2):221-9. doi: 10.1002/pmic.201200334. Epub 2013 Jan 3.
De novo protein structure prediction often generates a large population of candidates (models), and then selects near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. In this paper, we present a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise root mean square deviation and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and root mean square deviation and the correlation between Dscore2 and TM-score are high. Compared to the existing methods with calculation time quadratic to the number of models, our Dscore1-based clustering achieves a linearly time complexity while obtaining almost the same accuracy for near-native model selection. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500 k) models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php.
从头蛋白质结构预测通常会产生大量的候选物(模型),然后通过聚类来选择接近天然的模型。由于模型之间的两两距离计算,现有的结构模型聚类方法耗时。在本文中,我们提出了一种新的快速模型聚类方法,而不会降低聚类准确性。我们提出了两种新的距离度量方法,Dscore1 和 Dscore2,分别基于蛋白质距离矩阵的比较来描述模型之间的差异和相似性,而不是常用的两两均方根偏差和 TM 分数值。分析表明,Dscore1 与均方根偏差之间的相关性和 Dscore2 与 TM 分数之间的相关性都很高。与计算时间与模型数量的平方成正比的现有方法相比,我们的基于 Dscore1 的聚类方法实现了线性时间复杂度,同时在接近天然模型选择方面获得了几乎相同的准确性。通过使用 Dscore2 来选择聚类的代表,我们可以在计算时间略有增加的情况下进一步提高代表的质量。此外,对于大型 (~500k) 模型,我们可以在几秒钟到几分钟内根据 Dscore 分布进行快速数据可视化。我们的方法已经在一个名为 MUFOLD-CL 的软件包中实现,可在 http://mufold.org/clustering.php 上获得。