Mann Willi, Augsten Nikolaus, Jensen Christian S, Pawlik Mateusz
Celonis SE, Munich, Germany.
University of Salzburg, Salzburg, Austria.
VLDB J. 2025;34(1):13. doi: 10.1007/s00778-024-00880-x. Epub 2024 Dec 23.
We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top- result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top- result pair with any set already in the window. When a set leaves the window, all its pairings in the top- result must be replaced with other pairs. It is therefore not sufficient to maintain the most similar pairs since less similar pairs may become top- pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top- result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.
我们为旨在快速流中持续查找相似集对的应用程序提供高效支持,例如将推文作为单词集发出的推特流。使用滑动窗口模型,随着新集进入窗口或现有集离开窗口,顶级结果会发生变化。具体而言,当一个集到达时,它可能会与窗口中已有的任何集形成一个新的顶级结果对。当一个集离开窗口时,其在顶级结果中的所有配对都必须被其他对替换。因此,仅维护最相似的对是不够的,因为不太相似的对可能稍后会成为顶级对。我们提出了SWOOP,一种高度可扩展的流连接算法。新颖的索引技术和复杂的过滤器在新集进入窗口时有效地修剪过时的对。SWOOP增量地维护一组可证明最小的相似对,以便随时更新顶级结果。实证研究证实,SWOOP能够支持比现有方法支持的速率快几个数量级的流速率。