Department of Psychology, Universität der Bundeswehr München, Werner-Heisenberg-Weg 39, Neubiberg, 85577, Germany.
Max Planck Institute for Human Development, Department for Lifespan Psychology, Berlin, Lentzeallee 94, Berlin, 14195, Germany.
BMC Bioinformatics. 2018 Oct 12;19(1):375. doi: 10.1186/s12859-018-2359-z.
Bayesian clustering algorithms, in particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set. However, in many applied cases a single clustering solution is desired, requiring a 'best' partition to be created from the posterior sample. It is an open research question which solution should be recommended in which situation. However, one such candidate is the sample mean, defined as the clustering with minimal squared distance to all partitions in the posterior sample, weighted by their probability. In this article, we review an algorithm that approximates this sample mean by using the Hungarian Method to compute the distance between partitions. This algorithm leaves room for further processing acceleration.
We highlight a faster variant of the partition distance reduction that leads to a runtime complexity that is up to two orders of magnitude lower than the standard variant. We suggest two further improvements: The first is deterministic and based on an adapted dynamical version of the Hungarian Algorithm, which achieves another runtime decrease of at least one order of magnitude. The second improvement is theoretical and uses Monte Carlo techniques and the dynamic matrix inverse. Thereby we further reduce the runtime complexity by nearly the square root of one order of magnitude.
Overall this results in a new mean partition algorithm with an acceleration factor reaching beyond that of the present algorithm by the size of the partitions. The new algorithm is implemented in Java and available on GitHub (Glassen, Mean Partition, 2018).
贝叶斯聚类算法,特别是利用狄利克雷过程 (DP) 的算法,返回的是集合划分的后验分布的样本。然而,在许多应用案例中,人们希望得到一个单一的聚类解决方案,这就需要从后验样本中创建一个“最佳”的聚类。然而,有一个这样的候选方案是样本均值,定义为与后验样本中所有划分的平方距离最小的聚类,其权重由其概率决定。在本文中,我们回顾了一种通过使用匈牙利算法来计算划分之间的距离来近似这个样本均值的算法。该算法为进一步的处理加速提供了空间。
我们强调了一种更快的划分距离减少的变体,它的运行时间复杂度比标准变体低两个数量级。我们提出了两个进一步的改进:第一个是确定性的,基于匈牙利算法的自适应动态版本,它可以实现至少一个数量级的运行时间减少。第二个改进是理论上的,使用了蒙特卡罗技术和动态矩阵逆。通过这种方式,我们进一步将运行时间复杂度降低了近一个数量级的平方根。
总的来说,这导致了一个新的均值聚类算法,其加速因子达到了划分大小的现有算法的数倍以上。新算法是用 Java 实现的,并在 GitHub 上提供(Glassen,Mean Partition,2018)。