Faculty of Electrical Engineering, Czech Technical University, Karlovo námestí 13, 121 35 Prague, Czech Republic.
IEEE Trans Pattern Anal Mach Intell. 2010 Feb;32(2):371-7. doi: 10.1109/TPAMI.2009.166.
We propose a randomized data mining method that finds clusters of spatially overlapping images. The core of the method relies on the min-Hash algorithm for fast detection of pairs of images with spatial overlap, the so-called cluster seeds. The seeds are then used as visual queries to obtain clusters which are formed as transitive closures of sets of partially overlapping images that include the seed. We show that the probability of finding a seed for an image cluster rapidly increases with the size of the cluster. The properties and performance of the algorithm are demonstrated on data sets with 10(4), 10(5), and 5 x 10(6) images. The speed of the method depends on the size of the database and the number of clusters. The first stage of seed generation is close to linear for databases sizes up to approximately 2(34) approximately 10(10) images. On a single 2.4 GHz PC, the clustering process took only 24 minutes for a standard database of more than 100,000 images, i.e., only 0.014 seconds per image.
我们提出了一种随机数据挖掘方法,用于发现空间重叠图像的聚类。该方法的核心依赖于 min-Hash 算法,用于快速检测具有空间重叠的图像对,即所谓的聚类种子。然后,这些种子被用作视觉查询,以获取由包含种子的部分重叠图像的集合形成的聚类。我们表明,找到图像聚类种子的概率随着聚类的大小迅速增加。该算法的性质和性能在具有 10(4)、10(5)和 5 x 10(6)个图像的数据集上进行了演示。该方法的速度取决于数据库的大小和聚类的数量。对于大小约为 2(34)个大约 10(10)个图像的数据库,种子生成的第一阶段接近线性。在单个 2.4GHz PC 上,对于一个包含超过 100,000 个图像的标准数据库,聚类过程仅需 24 分钟,即每个图像仅需 0.014 秒。