Wang Yanhao, Fabbri Francesco, Mathioudakis Michael, Li Jia
School of Data Science and Engineering, East China Normal University, Shanghai 200062, China.
Spotify, 08000 Barcelona, Spain.
Entropy (Basel). 2023 Jul 14;25(7):1066. doi: 10.3390/e25071066.
Diversity maximization is a fundamental problem with broad applications in data summarization, web search, and recommender systems. Given a set of elements, the problem asks for a subset of k≪n elements with maximum diversity, as quantified by the dissimilarities among the elements in . In this paper, we study diversity maximization with fairness constraints in streaming and sliding-window models. Specifically, we focus on the max-min diversity maximization problem, which selects a subset that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set is partitioned into disjoint groups by a specific sensitive attribute, e.g., sex or race, ensuring fairness requires that the selected subset contains ki elements from each group i∈[m]. Although diversity maximization has been extensively studied, existing algorithms for fair max-min diversity maximization are inefficient for data streams. To address the problem, we first design efficient approximation algorithms for this problem in the (insert-only) streaming model, where data arrive one element at a time, and a solution should be computed based on the elements observed in one pass. Furthermore, we propose approximation algorithms for this problem in the sliding-window model, where only the latest elements in the stream are considered for computation to capture the recency of the data. Experimental results on real-world and synthetic datasets show that our algorithms provide solutions of comparable quality to the state-of-the-art offline algorithms while running several orders of magnitude faster in the streaming and sliding-window settings.
多样性最大化是一个在数据汇总、网络搜索和推荐系统等领域有广泛应用的基本问题。给定一组元素,该问题要求找出一个包含k≪n个元素的子集,其具有最大的多样性,这是通过子集中元素之间的差异来量化的。在本文中,我们研究了流模型和滑动窗口模型中带有公平性约束的多样性最大化问题。具体来说,我们关注最大最小多样性最大化问题,即选择一个子集,使得子集中任意两个不同元素之间的最小距离(差异)最大化。假设集合由一个特定的敏感属性(例如性别或种族)划分为m个不相交的组,确保公平性要求所选子集在每个组i∈[m]中包含ki个元素。尽管多样性最大化已经得到了广泛研究,但现有的公平最大最小多样性最大化算法对于数据流来说效率低下。为了解决这个问题,我们首先在(仅插入)流模型中为该问题设计了高效的近似算法,在这种模型中数据一次到达一个元素,并且应该基于一次遍历中观察到的元素来计算解决方案。此外,我们还为滑动窗口模型中的这个问题提出了近似算法,在该模型中,仅考虑流中最新的元素进行计算,以捕捉数据的时效性。在真实世界和合成数据集上的实验结果表明,我们的算法在流模型和滑动窗口设置下运行速度比最先进的离线算法快几个数量级的同时,还能提供质量相当的解决方案。