Strouse D J, Schwab David J
Department of Physics, Princeton University, Princeton, NJ 08544, U.S.A.
Department of Physics, Northwestern University, Evanston, IL 60208, U.S.A.
Neural Comput. 2017 Jun;29(6):1611-1630. doi: 10.1162/NECO_a_00961. Epub 2017 Apr 14.
Lossy compression and clustering fundamentally involve a decision about which features are relevant and which are not. The information bottleneck method (IB) by Tishby, Pereira, and Bialek ( 1999 ) formalized this notion as an information-theoretic optimization problem and proposed an optimal trade-off between throwing away as many bits as possible and selectively keeping those that are most important. In the IB, compression is measured by mutual information. Here, we introduce an alternative formulation that replaces mutual information with entropy, which we call the deterministic information bottleneck (DIB) and argue better captures this notion of compression. As suggested by its name, the solution to the DIB problem turns out to be a deterministic encoder, or hard clustering, as opposed to the stochastic encoder, or soft clustering, that is optimal under the IB. We compare the IB and DIB on synthetic data, showing that the IB and DIB perform similarly in terms of the IB cost function, but that the DIB significantly outperforms the IB in terms of the DIB cost function. We also empirically find that the DIB offers a considerable gain in computational efficiency over the IB, over a range of convergence parameters. Our derivation of the DIB also suggests a method for continuously interpolating between the soft clustering of the IB and the hard clustering of the DIB.
有损压缩和聚类从根本上涉及到一个关于哪些特征相关、哪些特征不相关的决策。Tishby、Pereira和Bialek(1999)提出的信息瓶颈方法(IB)将这一概念形式化为一个信息论优化问题,并提出了在尽可能丢弃更多比特与有选择地保留最重要比特之间的最优权衡。在IB中,压缩通过互信息来衡量。在这里,我们引入一种替代公式,用熵取代互信息,我们将其称为确定性信息瓶颈(DIB),并认为它能更好地捕捉这种压缩概念。顾名思义,DIB问题的解决方案是一个确定性编码器,即硬聚类,这与在IB下最优的随机编码器(即软聚类)形成对比。我们在合成数据上比较了IB和DIB,结果表明,在IB成本函数方面,IB和DIB的表现相似,但在DIB成本函数方面,DIB明显优于IB。我们还通过实验发现,在一系列收敛参数范围内,DIB在计算效率上比IB有显著提升。我们对DIB的推导还提出了一种在IB的软聚类和DIB的硬聚类之间进行连续插值的方法。