Borisov Vadim, Leemann Tobias, Sebler Kathrin, Haug Johannes, Pawelczyk Martin, Kasneci Gjergji
IEEE Trans Neural Netw Learn Syst. 2024 Jun;35(6):7499-7519. doi: 10.1109/TNNLS.2022.3229161. Epub 2024 Jun 3.
Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous datasets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains highly challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data and also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with 11 deep learning approaches across five popular real-world tabular datasets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
异构表格数据是最常用的数据形式,对于众多关键且计算要求高的应用至关重要。在同构数据集上,深度神经网络已多次展现出卓越性能,因此被广泛采用。然而,它们在适应表格数据进行推理或数据生成任务方面仍极具挑战性。为推动该领域的进一步发展,本文对表格数据的当前深度学习方法进行了综述。我们将这些方法分为三类:数据变换、专用架构和正则化模型。对于每一类,我们的工作都对主要方法进行了全面概述。此外,我们还讨论了用于生成表格数据的深度学习方法,并概述了在表格数据上解释深度模型的策略。因此,我们的第一项贡献是梳理上述领域的主要研究方向和现有方法,同时突出相关挑战和开放性研究问题。我们的第二项贡献是在五个不同大小且具有不同学习目标的流行真实世界表格数据集上,对传统机器学习方法与11种深度学习方法进行实证比较。我们已将结果作为具有竞争力的基准公开,结果表明,在监督学习任务中,基于梯度提升树集成的算法大多仍优于深度学习模型,这表明用于表格数据的有竞争力的深度学习模型的研究进展停滞不前。据我们所知,这是对表格数据深度学习方法的首次深入综述;因此,这项工作可作为一个有价值的起点,指导对表格数据深度学习感兴趣的研究人员和从业者。