Suppr超能文献

HT-Fed-GAN:用于分散式表格数据合成的联邦生成模型

HT-Fed-GAN: Federated Generative Model for Decentralized Tabular Data Synthesis.

作者信息

Duan Shaoming, Liu Chuanyi, Han Peiyi, Jin Xiaopeng, Zhang Xinyi, He Tianyu, Pan Hezhong, Xiang Xiayu

机构信息

School of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China.

Insititute of Data Security, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China.

出版信息

Entropy (Basel). 2022 Dec 31;25(1):88. doi: 10.3390/e25010088.

Abstract

In this paper, we study the problem of privacy-preserving data synthesis (PPDS) for tabular data in a distributed multi-party environment. In a decentralized setting, for PPDS, federated generative models with differential privacy are used by the existing methods. Unfortunately, the existing models apply only to images or text data and not to tabular data. Unlike images, tabular data usually consist of mixed data types (discrete and continuous attributes) and real-world datasets with highly imbalanced data distributions. Existing methods hardly model such scenarios due to the multimodal distributions in the decentralized continuous columns and highly imbalanced categorical attributes of the clients. To solve these problems, we propose a federated generative model for decentralized tabular data synthesis (HT-Fed-GAN). There are three important parts of HT-Fed-GAN: the federated variational Bayesian Gaussian mixture model (Fed-VB-GMM), which is designed to solve the problem of multimodal distributions; federated conditional one-hot encoding with conditional sampling for global categorical attribute representation and rebalancing; and a privacy consumption-based federated conditional GAN for privacy-preserving decentralized data modeling. The experimental results on five real-world datasets show that HT-Fed-GAN obtains the best trade-off between the data utility and privacy level. For the data utility, the tables generated by HT-Fed-GAN are the most statistically similar to the original tables and the evaluation scores show that HT-Fed-GAN outperforms the state-of-the-art model in terms of machine learning tasks.

摘要

在本文中,我们研究了分布式多方环境下表格数据的隐私保护数据合成(PPDS)问题。在分散式设置中,对于PPDS,现有方法使用具有差分隐私的联邦生成模型。不幸的是,现有模型仅适用于图像或文本数据,不适用于表格数据。与图像不同,表格数据通常由混合数据类型(离散和连续属性)以及数据分布高度不平衡的现实世界数据集组成。由于分散式连续列中的多模态分布以及客户端高度不平衡的分类属性,现有方法几乎无法对这种情况进行建模。为了解决这些问题,我们提出了一种用于分散式表格数据合成的联邦生成模型(HT-Fed-GAN)。HT-Fed-GAN有三个重要部分:联邦变分贝叶斯高斯混合模型(Fed-VB-GMM),旨在解决多模态分布问题;用于全局分类属性表示和重新平衡的带条件采样的联邦条件独热编码;以及用于隐私保护分散式数据建模的基于隐私消耗的联邦条件生成对抗网络。在五个真实世界数据集上的实验结果表明,HT-Fed-GAN在数据效用和隐私级别之间取得了最佳平衡。对于数据效用,HT-Fed-GAN生成的表格在统计上与原始表格最相似,评估分数表明HT-Fed-GAN在机器学习任务方面优于现有模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/580d/9858387/d91261b6155d/entropy-25-00088-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验