Traquete Francisco, Sousa Silva Marta, Ferreira António E N
FT-ICR and Structural Mass Spectrometry Laboratory, Faculdade de Ciências, Universidade de Lisboa, Portugal; Biosystems and Integrative Sciences Institute (BioISI), Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal.
Comput Biol Med. 2025 Jan;184:109414. doi: 10.1016/j.compbiomed.2024.109414. Epub 2024 Nov 14.
Untargeted metabolomics is an extremely useful approach for the discrimination of biological systems and biomarker identification. However, data analysis workflows are complex and face many challenges. Two of these challenges are the demand of high sample size and the possibility of severe class imbalance, which is particularly common in clinical studies. The latter can make statistical models less generalizable, increase the risk of overfitting and skew the analysis in favour of the majority class. One possible approach to mitigate this problem is data augmentation. However, the use of artificial data requires adequate data augmentation methods and criteria for assessing the quality of the generated data. In this work, we used Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty (CWGAN-GPs) for data augmentation of metabolomics data. Using a set of benchmark datasets, we applied several criteria for the evaluation of the quality of generated data and assessed the performance of supervised predictive models trained with datasets that included such data. CWGAN-GP models generated realistic data with identical characteristics to real samples, mostly avoiding mode collapse. Furthermore, in cases of class imbalance, the performance of predictive models improved by supplementing the minority class with generated samples. This is evident for high quality datasets with well separated classes. Conversely, model improvements were quite modest for high class overlap datasets. This trend was confirmed by using synthetic datasets with different class separation levels. Data augmentation is a viable procedure to alleviate class imbalance problems but is not universally beneficial in metabolomics.
非靶向代谢组学是一种用于区分生物系统和识别生物标志物的极其有用的方法。然而,数据分析工作流程复杂且面临许多挑战。其中两个挑战是对高样本量的需求以及严重类不平衡的可能性,这在临床研究中尤为常见。后者会使统计模型的通用性降低,增加过拟合风险,并使分析偏向多数类。缓解此问题的一种可能方法是数据增强。然而,人工数据的使用需要适当的数据增强方法和评估生成数据质量的标准。在这项工作中,我们使用带梯度惩罚的条件瓦瑟斯坦生成对抗网络(CWGAN-GP)对代谢组学数据进行数据增强。使用一组基准数据集,我们应用了几个标准来评估生成数据的质量,并评估了使用包含此类数据的数据集训练的监督预测模型的性能。CWGAN-GP模型生成了具有与真实样本相同特征的逼真数据,大多避免了模式坍塌。此外,在类不平衡的情况下,通过用生成的样本补充少数类,预测模型的性能得到了改善。对于具有明显分离类别的高质量数据集,这一点很明显。相反,对于高类重叠数据集,模型改进相当有限。使用具有不同类分离水平的合成数据集证实了这一趋势。数据增强是缓解类不平衡问题的可行方法,但在代谢组学中并非普遍有益。