Janakarajan Nikita, Graziani Mara, Rodríguez Martínez María
AI for Scientific Discovery, IBM Research Europe, Rüschlikon 8803, Switzerland.
D-INFK, ETH Zürich, Zürich 8092, Switzerland.
Bioinform Adv. 2025 May 23;5(1):vbaf124. doi: 10.1093/bioadv/vbaf124. eCollection 2025.
The application of machine learning methods to biomedical applications has seen many successes. However, working with transcriptomic data on supervised learning tasks is challenging due to its high dimensionality, low patient numbers, and class imbalances. Machine learning models tend to overfit these data and do not generalize well on out-of-distribution samples. Data augmentation strategies help alleviate this by introducing synthetic data points and acting as regularizers. However, existing approaches are either computationally intensive, require population parametric estimates, or generate insufficiently diverse samples. To address these challenges, we introduce two classes of phenotype-driven data augmentation approaches-signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. As case studies, we apply our augmentation methods to transcriptomic data of colorectal and breast cancer. Through discriminative and generative experiments with external validation, we show that our methods improve patient stratification by over other augmentation methods in their respective cases. The study additionally provides insights into the limited benefits of over-augmenting data.
Code for reproducibility is available on GitHub.
机器学习方法在生物医学应用中的应用已取得诸多成功。然而,由于转录组数据的高维度、患者数量少以及类别不平衡,在监督学习任务中处理这些数据具有挑战性。机器学习模型往往会过度拟合这些数据,并且在分布外样本上的泛化能力不佳。数据增强策略通过引入合成数据点并作为正则化器来帮助缓解这一问题。然而,现有方法要么计算量大,需要总体参数估计,要么生成的样本多样性不足。为应对这些挑战,我们引入了两类由表型驱动的数据增强方法——依赖特征和不依赖特征的方法。依赖特征的方法假设存在描述某些表型的独特基因特征,是简单、非参数且新颖的数据增强方法。不依赖特征的方法是对已有的用于基因表达数据的伽马 - 泊松和泊松采样方法的改进。作为案例研究,我们将我们的增强方法应用于结直肠癌和乳腺癌的转录组数据。通过具有外部验证的判别性和生成性实验,我们表明在各自案例中,我们的方法比其他增强方法能更好地改善患者分层。该研究还深入探讨了过度增强数据的有限益处。
可在GitHub上获取用于重现性研究的代码。