• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

合成数据在异质性和罕见医疗保健人群中的适用性:患有癌症的青少年和青年成年人

Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.

作者信息

Hogenboom Joshi, Lobo Gomes Aiara, Dekker Andre, Van Der Graaf Winette, Husson Olga, Wee Leonard

机构信息

Department of Radiation Oncology (Maastro), GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, the Netherlands.

Department of Medical Oncology, Netherlands Cancer Institute, Amsterdam, the Netherlands.

出版信息

JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.

DOI:10.1200/CCI.24.00056
PMID:39626135
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11627331/
Abstract

PURPOSE

Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.

METHODS

A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.

RESULTS

All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.

CONCLUSION

SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.

摘要

目的

罕见病和非典型医疗保健人群的研究常常因参与者之间的高度异质性和数据的总体稀缺性而放缓。合成数据(SD)已被提议作为数据共享、扩充和多样化的手段,通过人工生成真实现象,同时掩盖真实患者数据。合成数据在医疗保健研究中的效用正受到积极审查,但对于合成数据可操作性而言样本量的作用尚未得到充分探索。我们旨在通过从逐渐减少的真实个体数据量中生成不同大小的合成数据集,来了解可操作性与样本量之间的相互作用。我们在一个高度异质且罕见的人群中评估合成数据的可操作性:患有癌症的青少年和青年(AYA)。

方法

对一项基于人群的包含3735名青少年和青年的横断面队列研究进行随机抽样,以产生13个不同样本量的训练数据集。我们研究了基于开源合成数据保险库库构建的四种不同生成器架构。每种架构都用于根据上述每个训练子集生成不同大小的合成数据。通过将生成的合成数据与其各自的真实数据在准确性、效用和隐私隐藏这三个指标上进行比较,来评估合成数据的可操作性。

结果

当生成大小与真实数据相似的合成数据时,所有检查的生成器架构都产生了可操作的数据。大的合成数据样本量提高了准确性,但通常增加了隐私风险。使用较少的训练参与者会导致准确性更快收敛,但部分加剧了隐私隐藏问题。

结论

合成数据是数据共享和数据扩充的一个潜在有前景的选择,然而样本量在其可操作性中起着重要作用。合成数据生成应与持续审查同步进行,并且在此过程中应仔细考虑样本量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/18d96f4b05a6/cci-8-e2400056-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/599c027c0c10/cci-8-e2400056-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/e115d3a5ec73/cci-8-e2400056-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/a9900b559fc5/cci-8-e2400056-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/18d96f4b05a6/cci-8-e2400056-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/599c027c0c10/cci-8-e2400056-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/e115d3a5ec73/cci-8-e2400056-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/a9900b559fc5/cci-8-e2400056-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f20/11627331/18d96f4b05a6/cci-8-e2400056-g004.jpg

相似文献

1
Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.合成数据在异质性和罕见医疗保健人群中的适用性:患有癌症的青少年和青年成年人
JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Overcoming data scarcity in radiomics/radiogenomics using synthetic radiomic features.利用合成放射组学特征克服放射组学/放射基因组学中的数据稀缺性。
Comput Biol Med. 2024 May;174:108389. doi: 10.1016/j.compbiomed.2024.108389. Epub 2024 Mar 27.
4
Pharmacogenomic insights in psychiatric care: uncovering novel actionability, allele-specific CYP2D6 copy number variation, and phenoconversion in 15,000 patients.精神科医疗中的药物基因组学洞察:揭示 15000 名患者中的新型可操作性、等位基因特异性 CYP2D6 拷贝数变异和表型转化。
Mol Psychiatry. 2024 Nov;29(11):3495-3502. doi: 10.1038/s41380-024-02588-4. Epub 2024 May 23.
5
Trends in Cancer Incidence in US Adolescents and Young Adults, 1973-2015.美国青少年和青年人群癌症发病率趋势,1973-2015 年。
JAMA Netw Open. 2020 Dec 1;3(12):e2027738. doi: 10.1001/jamanetworkopen.2020.27738.
6
Evaluation of population-level pharmacogenetic actionability in Alabama.阿拉巴马州人群水平遗传药理学可操作性评估。
Clin Transl Sci. 2021 Nov;14(6):2327-2338. doi: 10.1111/cts.13097. Epub 2021 Jun 24.
7
A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。
BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.
8
Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation.医疗保健中的隐私保护:对用于合成数据生成的深度学习方法的系统综述。
Comput Methods Programs Biomed. 2025 Mar;260:108571. doi: 10.1016/j.cmpb.2024.108571. Epub 2024 Dec 28.
9
Public perceptions of disease severity but not actionability correlate with interest in receiving genomic results: nonalignment with current trends in practice.公众对疾病严重程度而非可操作性的认知与接受基因检测结果的意愿相关:与当前的实际趋势不一致。
Public Health Genomics. 2015;18(3):173-83. doi: 10.1159/000375479. Epub 2015 Mar 12.
10
A novel combined resilience and advance care planning intervention for adolescents and young adults with advanced cancer: A feasibility and acceptability cohort study.一种新的综合韧性和预先医疗照护计划干预措施用于患有晚期癌症的青少年和年轻成年人:一项可行性和可接受性队列研究。
Cancer. 2021 Dec 1;127(23):4504-4511. doi: 10.1002/cncr.33830. Epub 2021 Aug 6.

本文引用的文献

1
Getting real about synthetic data ethics : Are AI ethics principles a good starting point for synthetic data ethics?关于合成数据伦理的现实思考:人工智能伦理原则是否是合成数据伦理的良好起点?
EMBO Rep. 2024 May;25(5):2152-2155. doi: 10.1038/s44319-024-00101-0. Epub 2024 Feb 22.
2
Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.合成数据改善早发性结直肠癌生存状态预测模型。
JCO Clin Cancer Inform. 2024 Jan;8:e2300201. doi: 10.1200/CCI.23.00201.
3
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.
评估合成乳腺癌临床试验数据集的效用和隐私性。
JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.
4
Opportunities and Challenges of Synthetic Data Generation in Oncology.肿瘤学中合成数据生成的机遇与挑战。
JCO Clin Cancer Inform. 2023 Aug;7:e2300045. doi: 10.1200/CCI.23.00045.
5
Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology.人工智能生成合成数据以加速血液学研究和精准医学
JCO Clin Cancer Inform. 2023 Jun;7:e2300021. doi: 10.1200/CCI.23.00021.
6
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.使用条件生成对抗网络结合差分隐私生成合成个人健康数据。
J Biomed Inform. 2023 Jul;143:104404. doi: 10.1016/j.jbi.2023.104404. Epub 2023 Jun 1.
7
Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.使用来自国家数据集的临床数据,对多种样本量的合成数据集进行训练的机器学习模型,用于预测血压。
PLoS One. 2023 Mar 16;18(3):e0283094. doi: 10.1371/journal.pone.0283094. eCollection 2023.
8
Synthetic data in health care: A narrative review.医疗保健中的合成数据:一篇叙述性综述。
PLOS Digit Health. 2023 Jan 6;2(1):e0000082. doi: 10.1371/journal.pdig.0000082. eCollection 2023 Jan.
9
A Multifaceted benchmarking of synthetic electronic health record generation models.综合电子健康记录生成模型的多方面基准测试。
Nat Commun. 2022 Dec 9;13(1):7609. doi: 10.1038/s41467-022-35295-1.
10
A Negative Body Image among Adolescent and Young Adult (AYA) Cancer Survivors: Results from the Population-Based SURVAYA Study.青少年和青年癌症幸存者的负面身体形象:基于人群的SURVAYA研究结果。
Cancers (Basel). 2022 Oct 26;14(21):5243. doi: 10.3390/cancers14215243.