Suppr超能文献

使用CWGAN-GP框架进行数据增强,加强对不平衡非靶向代谢组学数据集的监督分析。

Enhancing supervised analysis of imbalanced untargeted metabolomics datasets using a CWGAN-GP framework for data augmentation.

作者信息

Traquete Francisco, Sousa Silva Marta, Ferreira António E N

机构信息

FT-ICR and Structural Mass Spectrometry Laboratory, Faculdade de Ciências, Universidade de Lisboa, Portugal; Biosystems and Integrative Sciences Institute (BioISI), Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal.

出版信息

Comput Biol Med. 2025 Jan;184:109414. doi: 10.1016/j.compbiomed.2024.109414. Epub 2024 Nov 14.

Abstract

Untargeted metabolomics is an extremely useful approach for the discrimination of biological systems and biomarker identification. However, data analysis workflows are complex and face many challenges. Two of these challenges are the demand of high sample size and the possibility of severe class imbalance, which is particularly common in clinical studies. The latter can make statistical models less generalizable, increase the risk of overfitting and skew the analysis in favour of the majority class. One possible approach to mitigate this problem is data augmentation. However, the use of artificial data requires adequate data augmentation methods and criteria for assessing the quality of the generated data. In this work, we used Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty (CWGAN-GPs) for data augmentation of metabolomics data. Using a set of benchmark datasets, we applied several criteria for the evaluation of the quality of generated data and assessed the performance of supervised predictive models trained with datasets that included such data. CWGAN-GP models generated realistic data with identical characteristics to real samples, mostly avoiding mode collapse. Furthermore, in cases of class imbalance, the performance of predictive models improved by supplementing the minority class with generated samples. This is evident for high quality datasets with well separated classes. Conversely, model improvements were quite modest for high class overlap datasets. This trend was confirmed by using synthetic datasets with different class separation levels. Data augmentation is a viable procedure to alleviate class imbalance problems but is not universally beneficial in metabolomics.

摘要

非靶向代谢组学是一种用于区分生物系统和识别生物标志物的极其有用的方法。然而,数据分析工作流程复杂且面临许多挑战。其中两个挑战是对高样本量的需求以及严重类不平衡的可能性,这在临床研究中尤为常见。后者会使统计模型的通用性降低,增加过拟合风险,并使分析偏向多数类。缓解此问题的一种可能方法是数据增强。然而,人工数据的使用需要适当的数据增强方法和评估生成数据质量的标准。在这项工作中,我们使用带梯度惩罚的条件瓦瑟斯坦生成对抗网络(CWGAN-GP)对代谢组学数据进行数据增强。使用一组基准数据集,我们应用了几个标准来评估生成数据的质量,并评估了使用包含此类数据的数据集训练的监督预测模型的性能。CWGAN-GP模型生成了具有与真实样本相同特征的逼真数据,大多避免了模式坍塌。此外,在类不平衡的情况下,通过用生成的样本补充少数类,预测模型的性能得到了改善。对于具有明显分离类别的高质量数据集,这一点很明显。相反,对于高类重叠数据集,模型改进相当有限。使用具有不同类分离水平的合成数据集证实了这一趋势。数据增强是缓解类不平衡问题的可行方法,但在代谢组学中并非普遍有益。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验