Karlberg Brian, Kirchgaessner Raphael, Lee Jordan, Peterkort Matthew, Beckman Liam, Goecks Jeremy, Ellrott Kyle
Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA.
Department of Machine Learning, Moffitt Cancer Center, Tampa, USA.
Genome Biol. 2024 Dec 18;25(1):309. doi: 10.1186/s13059-024-03431-3.
The accuracy of machine learning methods is often limited by the amount of training data that is available. We proposed to improve machine learning training regimes by augmenting datasets with synthetically generated samples. We present a method for synthesizing gene expression samples and test the system's capabilities for improving the accuracy of categorical prediction of cancer subtypes. We developed SyntheVAEiser, a variational autoencoder based tool that was trained and tested on over 8000 cancer samples. We have shown that this technique can be used to augment machine learning tasks and increase performance of recognition of underrepresented cohorts.
机器学习方法的准确性常常受到可用训练数据量的限制。我们提议通过用合成生成的样本扩充数据集来改进机器学习训练方式。我们提出了一种合成基因表达样本的方法,并测试了该系统在提高癌症亚型分类预测准确性方面的能力。我们开发了SyntheVAEiser,这是一种基于变分自编码器的工具,在8000多个癌症样本上进行了训练和测试。我们已经表明,这项技术可用于扩充机器学习任务,并提高对代表性不足队列的识别性能。