Suppr
超能文献

使用瓦瑟斯坦损失的确定性自动编码器用于表格数据生成。

Deterministic Autoencoder using Wasserstein loss for tabular data generation.

作者信息

Wang Alex X, Nguyen Binh P

机构信息

School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6012, New Zealand.

School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6012, New Zealand; Faculty of Information Technology, Ho Chi Minh City Open University, 97 Vo Van Tan, District 3, Ho Chi Minh City 70000, Viet Nam.

出版信息

Neural Netw. 2025 May;185:107208. doi: 10.1016/j.neunet.2025.107208. Epub 2025 Jan 29.

DOI:10.1016/j.neunet.2025.107208

PMID:39893805

Abstract

Tabular data generation is a complex task due to its distinctive characteristics and inherent complexities. While Variational Autoencoders have been adapted from the computer vision domain for tabular data synthesis, their reliance on non-deterministic latent space regularization introduces limitations. The stochastic nature of Variational Autoencoders can contribute to collapsed posteriors, yielding suboptimal outcomes and limiting control over the latent space. This characteristic also constrains the exploration of latent space interpolation. To address these challenges, we present the Tabular Wasserstein Autoencoder (TWAE), leveraging the deterministic encoding mechanism of Wasserstein Autoencoders. This characteristic facilitates a deterministic mapping of inputs to latent codes, enhancing the stability and expressiveness of our model's latent space. This, in turn, enables seamless integration with shallow interpolation mechanisms like the synthetic minority over-sampling technique (SMOTE) within the data generation process via deep learning. Specifically, TWAE is trained once to establish a low-dimensional representation of real data, and various latent interpolation methods efficiently generate synthetic latent points, achieving a balance between accuracy and efficiency. Extensive experiments consistently demonstrate TWAE's superiority, showcasing its versatility across diverse feature types and dataset sizes. This innovative approach, combining WAE principles with shallow interpolation, effectively leverages SMOTE's advantages, establishing TWAE as a robust solution for complex tabular data synthesis.

摘要

表格数据生成是一项复杂的任务，因其独特的特征和内在的复杂性。虽然变分自编码器已从计算机视觉领域改编用于表格数据合成，但其对非确定性潜在空间正则化的依赖带来了局限性。变分自编码器的随机性质可能导致后验分布坍塌，产生次优结果并限制对潜在空间的控制。这一特性还限制了潜在空间插值的探索。为应对这些挑战，我们提出了表格瓦瑟斯坦自编码器（TWAE），利用瓦瑟斯坦自编码器的确定性编码机制。这一特性有助于将输入确定性地映射到潜在代码，增强了我们模型潜在空间的稳定性和表现力。这反过来又使得在数据生成过程中能够通过深度学习与诸如合成少数过采样技术（SMOTE）等浅层插值机制无缝集成。具体而言，TWAE经过一次训练以建立真实数据的低维表示，各种潜在插值方法有效地生成合成潜在点，在准确性和效率之间取得平衡。大量实验一致证明了TWAE的优越性，展示了其在各种特征类型和数据集大小上的通用性。这种将WAE原理与浅层插值相结合的创新方法有效地利用了SMOTE的优势，将TWAE确立为复杂表格数据合成的强大解决方案。