Suppr超能文献

纵向队列研究的合成数据生成 - 评估、方法扩展和已发表数据分析结果的再现。

Synthetic data generation for a longitudinal cohort study - evaluation, method extension and reproduction of published data analysis results.

机构信息

Knowledge Management, ZB MED - Information Centre for Life Sciences, 50931, Cologne, Germany.

Faculty of Technology, Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, 33615, Bielefeld, Germany.

出版信息

Sci Rep. 2024 Jun 22;14(1):14412. doi: 10.1038/s41598-024-62102-2.

Abstract

Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e., data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.

摘要

获取个人层面的健康数据对于获得新的见解和推进科学研究至关重要。特别是,基于人工智能的现代方法依赖于大量数据集的可用性和可访问性。在医疗保健领域,由于隐私问题,获取个人层面的数据往往具有挑战性。一种很有前途的替代方法是生成完全合成的数据,即通过随机过程生成的数据,这些数据具有与原始数据相似的统计特性,但与原始的个人记录没有一一对应关系。在这项研究中,我们使用了一种最先进的合成数据生成方法,并对生成的数据进行了深入的质量分析,以用于营养领域的一个特定用例。我们证明了需要对合成数据进行仔细的分析,这些分析不仅要超越描述性统计,还要深入了解如何充分发挥合成数据集的潜力。通过扩展方法,同时通过彻底分析从训练有素的模型中抽样的效果,我们能够在选定的用例中很大程度上重现重要的真实世界分析结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/87a7/11193715/17c32bfd5e94/41598_2024_62102_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验