• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SWOOP:基于集合流的前k相似性连接

SWOOP: top-k similarity joins over set streams.

作者信息

Mann Willi, Augsten Nikolaus, Jensen Christian S, Pawlik Mateusz

机构信息

Celonis SE, Munich, Germany.

University of Salzburg, Salzburg, Austria.

出版信息

VLDB J. 2025;34(1):13. doi: 10.1007/s00778-024-00880-x. Epub 2024 Dec 23.

DOI:10.1007/s00778-024-00880-x
PMID:39723165
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11666680/
Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top- result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top- result pair with any set already in the window. When a set leaves the window, all its pairings in the top- result must be replaced with other pairs. It is therefore not sufficient to maintain the most similar pairs since less similar pairs may become top- pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top- result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.

摘要

我们为旨在快速流中持续查找相似集对的应用程序提供高效支持,例如将推文作为单词集发出的推特流。使用滑动窗口模型,随着新集进入窗口或现有集离开窗口,顶级结果会发生变化。具体而言,当一个集到达时,它可能会与窗口中已有的任何集形成一个新的顶级结果对。当一个集离开窗口时,其在顶级结果中的所有配对都必须被其他对替换。因此,仅维护最相似的对是不够的,因为不太相似的对可能稍后会成为顶级对。我们提出了SWOOP,一种高度可扩展的流连接算法。新颖的索引技术和复杂的过滤器在新集进入窗口时有效地修剪过时的对。SWOOP增量地维护一组可证明最小的相似对,以便随时更新顶级结果。实证研究证实,SWOOP能够支持比现有方法支持的速率快几个数量级的流速率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/32650c3f992f/778_2024_880_Fig22_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/084ddaf2aa36/778_2024_880_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/3d189621a99f/778_2024_880_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/9e7f124c8417/778_2024_880_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/c988fed5dcdb/778_2024_880_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/3e79a27a110b/778_2024_880_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/fa6bfef65c45/778_2024_880_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/eb414bae6213/778_2024_880_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/7f16f95bce5c/778_2024_880_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ae3652d006df/778_2024_880_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/f6bd1debb32f/778_2024_880_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/e7bcd1186b26/778_2024_880_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ec5344fa1c42/778_2024_880_Figc_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/b2579936b385/778_2024_880_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/9882d0fee482/778_2024_880_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/a52226960198/778_2024_880_Figd_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/0b1cc2cac84a/778_2024_880_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/8958ceef40fe/778_2024_880_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/f44b912ea327/778_2024_880_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/1d7da06cb085/778_2024_880_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/08be4dd053a0/778_2024_880_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ae32092adbe4/778_2024_880_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/4154498cc940/778_2024_880_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/589b7b058962/778_2024_880_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/31fad19b4920/778_2024_880_Fig21_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/32650c3f992f/778_2024_880_Fig22_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/084ddaf2aa36/778_2024_880_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/3d189621a99f/778_2024_880_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/9e7f124c8417/778_2024_880_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/c988fed5dcdb/778_2024_880_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/3e79a27a110b/778_2024_880_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/fa6bfef65c45/778_2024_880_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/eb414bae6213/778_2024_880_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/7f16f95bce5c/778_2024_880_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ae3652d006df/778_2024_880_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/f6bd1debb32f/778_2024_880_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/e7bcd1186b26/778_2024_880_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ec5344fa1c42/778_2024_880_Figc_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/b2579936b385/778_2024_880_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/9882d0fee482/778_2024_880_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/a52226960198/778_2024_880_Figd_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/0b1cc2cac84a/778_2024_880_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/8958ceef40fe/778_2024_880_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/f44b912ea327/778_2024_880_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/1d7da06cb085/778_2024_880_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/08be4dd053a0/778_2024_880_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/ae32092adbe4/778_2024_880_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/4154498cc940/778_2024_880_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/589b7b058962/778_2024_880_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/31fad19b4920/778_2024_880_Fig21_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a352/11666680/32650c3f992f/778_2024_880_Fig22_HTML.jpg

相似文献

1
SWOOP: top-k similarity joins over set streams.SWOOP:基于集合流的前k相似性连接
VLDB J. 2025;34(1):13. doi: 10.1007/s00778-024-00880-x. Epub 2024 Dec 23.
2
Fair Max-Min Diversity Maximization in Streaming and Sliding-Window Models.流模型和滑动窗口模型中的公平最大最小多样性最大化
Entropy (Basel). 2023 Jul 14;25(7):1066. doi: 10.3390/e25071066.
3
A hashtag recommendation system for twitter data streams.一种用于推特数据流的主题标签推荐系统。
Comput Soc Netw. 2016;3(1):3. doi: 10.1186/s40649-016-0028-9. Epub 2016 May 31.
4
Efficient and scalable graph similarity joins in MapReduce.MapReduce中高效且可扩展的图相似性连接
ScientificWorldJournal. 2014;2014:749028. doi: 10.1155/2014/749028. Epub 2014 Jul 8.
5
Fast Inbound Top-K Query for Random Walk with Restart.带重启的随机游走的快速入站Top-K查询
Mach Learn Knowl Discov Databases. 2015 Sep;9285:608-624. doi: 10.1007/978-3-319-23525-7_37. Epub 2015 Aug 29.
6
Beyond Equi-joins: Ranking, Enumeration and Factorization.超越等值连接:排序、枚举与因式分解
Proceedings VLDB Endowment. 2021 Jul;14(11):2599-2612. doi: 10.14778/3476249.3476306. Epub 2021 Oct 27.
7
Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams.蚁群流聚类:一种用于动态数据流的快速密度聚类算法。
IEEE Trans Cybern. 2019 Jun;49(6):2215-2228. doi: 10.1109/TCYB.2018.2822552. Epub 2018 May 10.
8
A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.一个用于韩语医学词汇语义相似性和相关性的词对数据集:参考开发与验证
JMIR Med Inform. 2021 Jun 24;9(6):e29667. doi: 10.2196/29667.
9
Designing a Streaming Algorithm for Outlier Detection in Data Mining-An Incrementa Approach.设计一种用于数据挖掘中异常值检测的流算法-一种增量方法。
Sensors (Basel). 2020 Feb 26;20(5):1261. doi: 10.3390/s20051261.
10
Visual Structural Assessment and Anomaly Detection for High-Velocity Data Streams.高速数据流的可视化结构评估和异常检测。
IEEE Trans Cybern. 2021 Dec;51(12):5979-5992. doi: 10.1109/TCYB.2020.2973137. Epub 2021 Dec 22.

本文引用的文献

1
Shopper intent prediction from clickstream e-commerce data with minimal browsing information.基于点击流电子商务数据的购物者意图预测,仅使用最少的浏览信息。
Sci Rep. 2020 Oct 12;10(1):16983. doi: 10.1038/s41598-020-73622-y.