Gaspar Chiquitto Alisson, Oliveira Liliane Santana, Bugatti Pedro Henrique, Saito Priscila Tiemi Maeda, Basham Mark, Raittz Roberto Tadeu, Paschoal Alexandre Rossi
Department of Computer Science, Federal University of Technology of Paraná-UTFPR, Cornélio Procópio, 86300-000, Brazil.
Department of Informatics, Federal Institute of Education, Science and Technology of Mato Grosso do Sul - IFMS, Naviraí, 79947-334, Brazil.
NAR Genom Bioinform. 2025 Jun 20;7(2):lqaf072. doi: 10.1093/nargab/lqaf072. eCollection 2025 Jun.
Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.
对非编码RNA(ncRNA)序列,尤其是mirtrons进行分类,对于阐明基因调控机制至关重要。然而,ncRNA数据集中普遍存在的类别不平衡带来了重大挑战,常常导致机器学习模型出现过拟合和泛化能力下降的问题。在本研究中,我们提出了GENNUS(核苷酸序列生成方法),引入了使用生成对抗网络(GANs)和合成少数类过采样技术(SMOTE)的新型数据增强策略,以提高mirtron和经典微小RNA(miRNA)的分类性能。我们基于GAN的方法有效地生成了高质量的合成数据,这些数据捕捉了真实mirtron序列的复杂模式和多样性,从而无需进行大量的特征工程。通过四项实验表明,与传统的SMOTE技术或仅使用真实数据相比,在真实数据和GAN生成的数据相结合的基础上训练的模型提高了分类准确率。我们的研究结果表明,GANs增强了模型性能,并为少数类提供了更丰富的表示,从而提高了跨各种机器学习框架的泛化能力。这项工作突出了合成数据生成在解决基因组学数据限制方面的变革潜力,为更有效和可扩展的mirtron和经典miRNA分类方法提供了一条途径。GENNUS可在https://github.com/chiquitto/GENNUS获取;以及https://doi.org/10.6084/m9.figshare.28207328。