School of Computer Science, University of South China, Hengyang, 421001, Hunan, China.
Department of Mathematics, National University of Singapore, Singapore, 119076, Singapore.
Sci Rep. 2024 Nov 1;14(1):26321. doi: 10.1038/s41598-024-77107-0.
RNA methylation modification influences various processes in the human body and has gained increasing attention from scholars. Predicting genes associated with RNA methylation pathways can significantly aid biologists in studying RNA methylation processes. Several prediction methods have been investigated, but their performance is still limited by the scarcity of positive samples. To address the challenge of data imbalance in RNA methylation-associated gene prediction tasks, this study employed a generative adversarial network to learn the feature distribution of the original dataset. The quality of synthetic samples was controlled using the Classifier Two-Sample Test (CTST). These synthetic samples were then added to the data blocks to mitigate class distribution imbalance. Experimental results demonstrated that integrating the synthetic samples generated by our proposed model with the original data enhances the prediction performance of various classifiers, outperforming other oversampling methods. Moreover, gene ontology (GO) enrichment analyses further demonstrate the effectiveness of the predicted genes associated with RNA methylation pathways. The model generating gene samples with PyTorch is available at https://github.com/heyheyheyheyhey1/WGAN-GP_RNA_methylation.
RNA 甲基化修饰影响人体的各种过程,引起了学者们越来越多的关注。预测与 RNA 甲基化途径相关的基因可以极大地帮助生物学家研究 RNA 甲基化过程。已经研究了几种预测方法,但是它们的性能仍然受到阳性样本稀缺的限制。为了解决 RNA 甲基化相关基因预测任务中的数据不平衡挑战,本研究使用生成对抗网络来学习原始数据集的特征分布。使用分类器双样本测试 (CTST) 控制合成样本的质量。然后将这些合成样本添加到数据块中以减轻类别分布不平衡。实验结果表明,将我们提出的模型生成的合成样本与原始数据集成可以提高各种分类器的预测性能,优于其他过采样方法。此外,基因本体 (GO) 富集分析进一步证明了与 RNA 甲基化途径相关的预测基因的有效性。使用 PyTorch 生成基因样本的模型可在 https://github.com/heyheyheyheyhey1/WGAN-GP_RNA_methylation 上获得。