通过经验得出的合成人群以缓解小样本量问题。

Empirically-derived synthetic populations to mitigate small sample sizes.

作者信息

Fowler Erin E, Berglund Anders, Schell Michael J, Sellers Thomas A, Eschrich Steven, Heine John

机构信息

Cancer Epidemiology Department, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States.

Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States.

出版信息

J Biomed Inform. 2020 May;105:103408. doi: 10.1016/j.jbi.2020.103408. Epub 2020 Mar 12.

DOI:10.1016/j.jbi.2020.103408

PMID:32173502

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7839232/

Abstract

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.

摘要

在生物医学研究中，有限的样本量可能会导致虚假的建模结果。这项工作的目的是提出一种新方法，使用匹配的病例对照数据（n = 180对）从有限样本中生成合成总体（SPs），这些数据被视为两个独立的有限样本。使用具有无约束带宽矩阵的多元核密度估计（KDEs）生成SPs。我们为每个个体纳入了四个连续变量和一个分类变量。通过协方差比较，利用差分进化（DE）优化确定带宽矩阵。从各自的SPs中导出了四个合成样本（n = 180）。假设观察样本和合成样本的经验概率密度函数（EPDFs）相似，比较了它们之间的相似性。基于核双样本检验，通过最大均值差异（MMD）检验统计量比较EPDFs。为了在建模背景下评估相似性，作为额外的比较，用X空间中到模型的距离（DModX）总结了从主成分分析（PCA）得分和残差导出的EPDFs。从每个样本中生成了四个SPs。在随机构建合成样本（n = 180）时选择重复样本的概率极小。MMD检验表明，观察样本的EPDFs与各自的合成EPDFs相似。对于这些样本，与各自的合成样本相比，PCA得分和残差没有显著偏差。通过在个体水平上生成与观察样本在统计上相似的合成数据，证明了该方法的可行性。该方法将KDE与DE优化相结合，并采用了从PCA导出的新颖相似性度量。这种方法可用于生成更大规模的合成样本。为了将这种方法发展成为一种用于数据探索目的的研究工具，需要增加维度进行额外评估。此外，在给定完全指定的总体的情况下，将研究在准确合成各自总体时可以丢弃个体的程度。当这些目标实现后，为了进行全面评估，将需要与其他技术（如自助法）进行比较。

相似文献

Empirically-derived synthetic populations to mitigate small sample sizes.通过经验得出的合成人群以缓解小样本量问题。

J Biomed Inform. 2020 May;105:103408. doi: 10.1016/j.jbi.2020.103408. Epub 2020 Mar 12.

Techniques to produce and evaluate realistic multivariate synthetic data.生成和评估逼真的多变量合成数据的技术。

Sci Rep. 2023 Jul 28;13(1):12266. doi: 10.1038/s41598-023-38832-0.

Effect of finite sample size on feature selection and classification: a simulation study.有限样本大小对特征选择和分类的影响：一项模拟研究。

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

Two-sample statistics based on anisotropic kernels.基于各向异性核的双样本统计量。

Inf inference. 2020 Sep;9(3):677-719. doi: 10.1093/imaiai/iaz018. Epub 2019 Dec 10.

Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives.随机对照试验中的亚组分析：量化假阳性和假阴性风险

Health Technol Assess. 2001;5(33):1-56. doi: 10.3310/hta5330.

Multivariate modeling of complications with data driven variable selection: guarding against overfitting and effects of data set size.基于数据驱动变量选择的并发症的多变量建模：防止过拟合和数据集大小的影响。

Radiother Oncol. 2012 Oct;105(1):115-21. doi: 10.1016/j.radonc.2011.12.006. Epub 2012 Jan 20.

Methodological and conceptual issues regarding occupational psychosocial coronary heart disease epidemiology.职业心理社会因素与冠心病流行病学的方法学和概念性问题

Scand J Work Environ Health. 2016 May 1;42(3):251-5. doi: 10.5271/sjweh.3557. Epub 2016 Mar 9.

IEEE Trans Pattern Anal Mach Intell. 2006 Jun;28(6):917-29. doi: 10.1109/TPAMI.2006.120.

引用本文的文献

Techniques to produce and evaluate realistic multivariate synthetic data.生成和评估逼真的多变量合成数据的技术。

Sci Rep. 2023 Jul 28;13(1):12266. doi: 10.1038/s41598-023-38832-0.

A Simple-to-Use R Package for Mimicking Study Data by Simulations.一个用于通过模拟来模拟研究数据的简单易用的 R 包。

Methods Inf Med. 2023 Sep;62(3-04):119-129. doi: 10.1055/a-2048-7692. Epub 2023 Mar 7.

Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions.健康领域中涵盖相似性、实用性和隐私性维度的合成表格数据评估。

Methods Inf Med. 2023 Jun;62(S 01):e19-e38. doi: 10.1055/s-0042-1760247. Epub 2023 Jan 9.

Comparison of Machine Learning Techniques for Mortality Prediction in a Prospective Cohort of Older Adults.机器学习技术在老年前瞻性队列中死亡率预测的比较。

Int J Environ Res Public Health. 2021 Dec 4;18(23):12806. doi: 10.3390/ijerph182312806.

本文引用的文献

The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures.合成临床数据的有效性：使用临床质量指标对领先的合成数据生成器（Synthea）进行验证研究。

BMC Med Inform Decis Mak. 2019 Mar 14;19(1):44. doi: 10.1186/s12911-019-0793-0.

SynSys: A Synthetic Data Generation System for Healthcare Applications.SynSys：一种面向医疗保健应用的合成数据生成系统。

Sensors (Basel). 2019 Mar 8;19(5):1181. doi: 10.3390/s19051181.

Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective.让代表性不足人群参与临床研究的需求、优先事项及建议：社区视角

J Community Health. 2017 Jun;42(3):472-480. doi: 10.1007/s10900-016-0279-2.

Rare disease research: Breaking the privacy barrier.罕见病研究：突破隐私障碍。

Appl Transl Genom. 2014 Apr 18;3(2):23-9. doi: 10.1016/j.atg.2014.04.003. eCollection 2014 Jun 1.

Patient/family views on data sharing in rare diseases: study in the European LeukoTreat project.患者/家属对罕见病数据共享的看法：欧洲白细胞治疗项目研究

Eur J Hum Genet. 2016 Mar;24(3):338-43. doi: 10.1038/ejhg.2015.115. Epub 2015 Jun 17.

Breast Imaging Reporting and Data System (BI-RADS) breast composition descriptors: automated measurement development for full field digital mammography.乳腺成像报告和数据系统（BI-RADS）乳腺成分描述符：全视野数字化乳腺摄影的自动测量方法开发。

Med Phys. 2013 Nov;40(11):113502. doi: 10.1118/1.4824319.

Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data.统计学习方法作为生存分析的预处理步骤：使用肺癌数据评估概念。

Biomed Eng Online. 2011 Nov 8;10:97. doi: 10.1186/1475-925X-10-97.

Full field digital mammography and breast density: comparison of calibrated and noncalibrated measurements.全数字化乳腺摄影与乳腺密度：校准与未校准测量值的比较。

Acad Radiol. 2011 Nov;18(11):1430-6. doi: 10.1016/j.acra.2011.07.011.

A quantitative description of the percentage of breast density measurement using full-field digital mammography.使用全数字化乳腺摄影术测量乳腺密度的百分比的定量描述。

Acad Radiol. 2011 May;18(5):556-64. doi: 10.1016/j.acra.2010.12.015.

Calibrated measures for breast density estimation.乳腺密度估计的校准测量。

Acad Radiol. 2011 May;18(5):547-55. doi: 10.1016/j.acra.2010.12.007. Epub 2011 Mar 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验