Anandakrishnan Ramu, Onufriev Alexey
Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, USA.
J Comput Biol. 2008 Mar;15(2):165-84. doi: 10.1089/cmb.2007.0144.
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.
在统计力学中,粒子物理系统的平衡性质可计算为该系统可及微观状态的统计平均值。一般来说,这些计算在计算上难以处理,因为它们涉及对指数级大量微观状态的求和。聚类算法是用于对这些求和进行数值近似的方法之一。最基本的聚类算法首先将系统细分为一组较小的子集(簇)。然后,精确处理每个簇内粒子之间的相互作用,而忽略不同簇之间的所有相互作用。这些较小的簇具有少得多的微观状态,使得对这些微观状态的求和变得易于处理。这些算法先前已用于生物分子计算,但在此背景下仍相对未被探索。本文给出了对先前应用于生物分子静电学背景下的两种最基本聚类算法的误差和计算复杂度的理论分析。我们为通过这些聚类算法计算的粒子平衡态导出了一个紧密的、计算成本低的误差界。对于一些实际应用,可能更重要的是均方根误差,它可能显著低于误差界。我们表明误差界和均方根误差之间存在很强的经验关系,这表明误差界可作为一种计算成本低的度量,用于预测聚类算法在实际应用中的准确性。给出了此类应用——计算蛋白质中可电离氨基酸的平均电荷——的误差分析示例,表明聚类算法在实际应用中可以足够准确。