Wang Xiangyu, Chang Shuai
Department of Physical Education, Capital Normal University, Beijing, China.
Sci Prog. 2025 Jul-Sep;108(3):368504251366850. doi: 10.1177/00368504251366850. Epub 2025 Aug 6.
Background and ObjectiveMachine learning models offer a practical approach for estimating body fat percentage from simple anthropometric data. However, the scarcity of biomedical data frequently leads to model overfitting, compromising predictive accuracy. Generative data augmentation presents a promising strategy to address this limitation. This study develops and evaluates a generative data augmentation framework to enhance body fat prediction from limited anthropometric data.MethodsA public dataset comprising 249 male subjects was partitioned into development (80%) and test (20%) sets. The fidelity of Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP), random noise injection, and mixup was compared to select the optimal method. Subsequently, XGBoost, Support Vector Regression, and Multi-layer Perceptron models were trained and validated, comparing performance with and without the selected augmentation. Final model generalization was assessed on the independent test set using the coefficient of determination (R²), Mean Absolute Error, and Root Mean Squared Error.ResultsAmong the evaluated augmentation techniques, the WGAN-GP generated synthetic data with the highest fidelity. On the original data, the baseline XGBoost model achieved a R² of 0.67; this performance increased to 0.77 on the test set when using WGAN-GP augmentation. Feature importance analysis of the final model identified abdominal circumference as the most significant predictor of body fat percentage.ConclusionThe WGAN-GP is a highly effective method for generating realistic synthetic anthropometric data. Integrating these synthetic samples into the training pipeline substantially improves the generalization and predictive accuracy of machine learning models. This methodology offers a robust solution for developing more accurate and accessible predictive health models in data-scarce environments.
背景与目的
机器学习模型为从简单人体测量数据估计体脂百分比提供了一种实用方法。然而,生物医学数据的稀缺常常导致模型过度拟合,从而影响预测准确性。生成式数据增强是解决这一局限性的一种有前景的策略。本研究开发并评估了一种生成式数据增强框架,以提高基于有限人体测量数据的体脂预测能力。
方法
一个包含249名男性受试者的公共数据集被划分为开发集(80%)和测试集(20%)。比较了带梯度惩罚的瓦瑟斯坦生成对抗网络(WGAN-GP)、随机噪声注入和混合方法的保真度,以选择最优方法。随后,训练并验证了XGBoost、支持向量回归和多层感知器模型,并比较了有无所选增强方法时的性能。使用决定系数(R²)、平均绝对误差和均方根误差在独立测试集上评估最终模型的泛化能力。
结果
在评估的增强技术中,WGAN-GP生成的合成数据保真度最高。在原始数据上,基线XGBoost模型的R²为0.67;使用WGAN-GP增强时,该性能在测试集上提高到了0.77。最终模型的特征重要性分析确定腹围是体脂百分比最重要的预测因子。
结论
WGAN-GP是生成逼真的合成人体测量数据的高效方法。将这些合成样本整合到训练流程中可显著提高机器学习模型的泛化能力和预测准确性。该方法为在数据稀缺环境中开发更准确、更易获取的预测健康模型提供了一种可靠的解决方案。