The Institute of Mathematical Sciences, Chennai, India.
Homi Bhabha National Institute, Mumbai, India.
PLoS One. 2024 Apr 17;19(4):e0302271. doi: 10.1371/journal.pone.0302271. eCollection 2024.
We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but may also have hidden cluster structure in their rows: for example, they may be drawn from heterogeneous (geographical, socioeconomic, methodological) sources, such that the outcome variable they describe (such as the presence of a disease) may depend not only on the other variables but on the cluster context. Moreover, sharing of biomedical data is often hindered by patient confidentiality laws, and there is current interest in algorithms to generate synthetic tabular data from real data, for example via deep learning. We demonstrate a novel EM-based clustering algorithm, MMM ("Madras Mixture Model"), that outperforms standard algorithms in determining clusters in synthetic heterogeneous data, and recovers structure in real data. Based on this, we demonstrate a synthetic tabular data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data assuming cluster-specific data distributions for the input columns. We benchmark this algorithm by testing the performance of standard ML algorithms when they are trained on synthetic data and tested on real published datasets. Our synthetic data generation algorithm outperforms other literature tabular-data generators, and approaches the performance of training purely with real data.
聚类和合成数据生成。表格数据集通常由列中的异构数据类型(数值、有序、分类)组成,但也可能在行中具有隐藏的聚类结构:例如,它们可能来自异构(地理、社会经济、方法学)来源,使得它们所描述的因变量(例如疾病的存在)不仅取决于其他变量,还取决于聚类上下文。此外,由于患者保密法律的限制,生物医学数据的共享往往受到阻碍,目前人们对从真实数据生成合成表格数据的算法感兴趣,例如通过深度学习。我们展示了一种新颖的基于 EM 的聚类算法 MMM(“Madras Mixture Model”),它在确定合成异质数据中的聚类方面优于标准算法,并恢复了真实数据中的结构。在此基础上,我们展示了一种合成表格数据生成算法 MMMsynth,它对输入数据进行预聚类,并为输入列生成特定于聚类的合成数据,假设聚类特定的数据分布。我们通过在合成数据上训练标准 ML 算法并在已发表的真实数据集上进行测试来对该算法进行基准测试。我们的合成数据生成算法优于其他文献中的表格数据生成器,并接近仅使用真实数据进行训练的性能。