• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MMM 和 MMMSynth:异构表格数据的聚类和合成数据生成。

MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation.

机构信息

The Institute of Mathematical Sciences, Chennai, India.

Homi Bhabha National Institute, Mumbai, India.

出版信息

PLoS One. 2024 Apr 17;19(4):e0302271. doi: 10.1371/journal.pone.0302271. eCollection 2024.

DOI:10.1371/journal.pone.0302271
PMID:38630664
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11023594/
Abstract

We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but may also have hidden cluster structure in their rows: for example, they may be drawn from heterogeneous (geographical, socioeconomic, methodological) sources, such that the outcome variable they describe (such as the presence of a disease) may depend not only on the other variables but on the cluster context. Moreover, sharing of biomedical data is often hindered by patient confidentiality laws, and there is current interest in algorithms to generate synthetic tabular data from real data, for example via deep learning. We demonstrate a novel EM-based clustering algorithm, MMM ("Madras Mixture Model"), that outperforms standard algorithms in determining clusters in synthetic heterogeneous data, and recovers structure in real data. Based on this, we demonstrate a synthetic tabular data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data assuming cluster-specific data distributions for the input columns. We benchmark this algorithm by testing the performance of standard ML algorithms when they are trained on synthetic data and tested on real published datasets. Our synthetic data generation algorithm outperforms other literature tabular-data generators, and approaches the performance of training purely with real data.

摘要

我们提供了两种与异质表格数据集相关的任务的新算法

聚类和合成数据生成。表格数据集通常由列中的异构数据类型(数值、有序、分类)组成,但也可能在行中具有隐藏的聚类结构:例如,它们可能来自异构(地理、社会经济、方法学)来源,使得它们所描述的因变量(例如疾病的存在)不仅取决于其他变量,还取决于聚类上下文。此外,由于患者保密法律的限制,生物医学数据的共享往往受到阻碍,目前人们对从真实数据生成合成表格数据的算法感兴趣,例如通过深度学习。我们展示了一种新颖的基于 EM 的聚类算法 MMM(“Madras Mixture Model”),它在确定合成异质数据中的聚类方面优于标准算法,并恢复了真实数据中的结构。在此基础上,我们展示了一种合成表格数据生成算法 MMMsynth,它对输入数据进行预聚类,并为输入列生成特定于聚类的合成数据,假设聚类特定的数据分布。我们通过在合成数据上训练标准 ML 算法并在已发表的真实数据集上进行测试来对该算法进行基准测试。我们的合成数据生成算法优于其他文献中的表格数据生成器,并接近仅使用真实数据进行训练的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/e6843453de8c/pone.0302271.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/d7522e02a836/pone.0302271.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/c3aa7e3da46f/pone.0302271.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/38647c329f78/pone.0302271.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/9aa8e05c5210/pone.0302271.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/e6843453de8c/pone.0302271.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/d7522e02a836/pone.0302271.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/c3aa7e3da46f/pone.0302271.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/38647c329f78/pone.0302271.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/9aa8e05c5210/pone.0302271.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e57/11023594/e6843453de8c/pone.0302271.g005.jpg

相似文献

1
MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation.MMM 和 MMMSynth:异构表格数据的聚类和合成数据生成。
PLoS One. 2024 Apr 17;19(4):e0302271. doi: 10.1371/journal.pone.0302271. eCollection 2024.
2
Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.用于表格数据的端到端机器学习管道中效用和公平性的差分隐私合成数据评估。
PLoS One. 2024 Feb 5;19(2):e0297271. doi: 10.1371/journal.pone.0297271. eCollection 2024.
3
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.聚类组学:针对异构数据集的整合上下文相关聚类
PLoS Comput Biol. 2017 Oct 16;13(10):e1005781. doi: 10.1371/journal.pcbi.1005781. eCollection 2017 Oct.
4
Biclustering Models for Two-Mode Ordinal Data.二模有序数据的双聚类模型
Psychometrika. 2016 Sep;81(3):611-24. doi: 10.1007/s11336-016-9503-3. Epub 2016 Jun 21.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
Deep Neural Networks and Tabular Data: A Survey.深度神经网络与表格数据:一项综述。
IEEE Trans Neural Netw Learn Syst. 2024 Jun;35(6):7499-7519. doi: 10.1109/TNNLS.2022.3229161. Epub 2024 Jun 3.
7
Rough set based information theoretic approach for clustering uncertain categorical data.基于粗糙集的信息论聚类不确定分类数据方法。
PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.
8
Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions.健康领域中涵盖相似性、实用性和隐私性维度的合成表格数据评估。
Methods Inf Med. 2023 Jun;62(S 01):e19-e38. doi: 10.1055/s-0042-1760247. Epub 2023 Jan 9.
9
A novel and fully automated platform for synthetic tabular data generation and validation.一种新颖且完全自动化的表格数据生成和验证平台。
Sci Rep. 2024 Oct 7;14(1):23312. doi: 10.1038/s41598-024-73608-0.
10
PFClust: a novel parameter free clustering algorithm.PFClust:一种新颖的无参数聚类算法。
BMC Bioinformatics. 2013 Jul 3;14:213. doi: 10.1186/1471-2105-14-213.

本文引用的文献

1
Improving marginal likelihood estimation for Bayesian phylogenetic model selection.改进贝叶斯系统发育模型选择的边缘似然估计。
Syst Biol. 2011 Mar;60(2):150-60. doi: 10.1093/sysbio/syq085. Epub 2010 Dec 27.
2
Computing Bayes factors using thermodynamic integration.使用热力学积分计算贝叶斯因子。
Syst Biol. 2006 Apr;55(2):195-207. doi: 10.1080/10635150500433722.