Warmenhoven John, Impellizzeri Franco M, Shrier Ian, Vigotsky Andrew D, Lolli Lorenzo, Menaspà Paolo, Coutts Aaron J, Fanchini Maurizio, Hooker Giles
School of Sport, Exercise and Rehabilitation and Human Performance Research Centre, University of Technology Sydney (UTS), Sydney, Australia.
Australian Institute of Sport, Australian Sports Commission, Canberra, Australia.
Sports Med. 2025 Jun 26. doi: 10.1007/s40279-025-02221-6.
Synthetic data represent alternative data sources generated using mathematical procedures to address specific issues in research and practice. Synthetic data have emerging applications in clinical and medical data contexts and may assist in overcoming privacy issues to help support open science practice.
The present study discusses the applicability of an established synthetic data generation process using sequential tree-based algorithms (synthpop package in R) in the context of athlete monitoring data in sport, with the aim of providing an educational primer and discussion for potential application of these methods when exploring issues in the field sports and exercise sciences.
The software package in R, synthpop, was used in seven simulation conditions applied to a professional football dataset, with varying model constraints. Classification and regression trees were used as the base model framework for each simulation. Metrics associated with both global utility (overall dataset similarity) and specific utility (specific research outcome similarity) were assessed on each simulation condition.
All simulation conditions demonstrated high levels of global utility. Additionally, simpler simulation conditions, which more closely resembled the analysis of the original dataset (simulation condition 1 and 2), provided higher specific utility than more advanced simulation conditions.
To summarize, three types of models can be conceptualised for generating synthetic data: (1) models used for analysis of the original data (answering specific research questions), (2) models used to generate synthetic data, and (3) models that represent the true generation process for the original data. Misalignments in the specifications of these models might introduce biases that can compromise the utility of synthetic data no matter the purpose. As synthetic data do not constitute a direct replacement for real data from conceptual and empirical standpoints, we believe that researchers embracing this practice must include sufficient documentation concerning the synthetic data generation process purpose, the predictors and model used, and the potential boundary conditions for using the synthetic data in future investigations in sports and other fields.
合成数据是通过数学程序生成的替代数据源,用于解决研究和实践中的特定问题。合成数据在临床和医学数据领域有新兴应用,可能有助于克服隐私问题,以支持开放科学实践。
本研究讨论了一种使用基于树的顺序算法(R语言中的synthpop包)生成合成数据的既定过程在体育运动员监测数据背景下的适用性,旨在为在体育和运动科学领域探索问题时潜在应用这些方法提供一份教育性入门指南和讨论。
R语言中的软件包synthpop在应用于一个职业足球数据集的七种模拟条件下使用,模型约束各不相同。分类树和回归树用作每个模拟的基础模型框架。在每个模拟条件下评估与全局效用(整体数据集相似性)和特定效用(特定研究结果相似性)相关的指标。
所有模拟条件都显示出高水平的全局效用。此外,更接近原始数据集分析的较简单模拟条件(模拟条件1和2)比更高级的模拟条件提供了更高的特定效用。
总之,可以概念化三种类型的模型来生成合成数据:(1)用于分析原始数据(回答特定研究问题)的模型,(2)用于生成合成数据的模型,以及(3)代表原始数据真实生成过程的模型。这些模型规范中的不一致可能会引入偏差,无论目的如何,都可能损害合成数据的效用。由于从概念和实证角度来看,合成数据并不构成真实数据的直接替代品,我们认为采用这种做法的研究人员必须包括有关合成数据生成过程目的、所使用的预测变量和模型以及在未来体育和其他领域的调查中使用合成数据的潜在边界条件的充分文档。