Abbasi-Vineh Mohammad Ali, Rouzbahani Shirin, Kavousi Kaveh, Emadpour Masoumeh
Department of Agricultural Biotechnology, Tarbiat Modares University (TMU), Tehran, 1497713111, Iran.
Department of Bioinformatics, Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
Sci Rep. 2025 Jul 25;15(1):27079. doi: 10.1038/s41598-025-12796-9.
One key barrier to applying deep learning (DL) to omics and other biological datasets is data scarcity, particularly when each gene or protein is represented by a single sequence. This fundamental challenge is mainly relevant in research involving genetically constrained organisms, organelles, specialized cell types, and biological cycles and pathways. This study introduces a novel data augmentation strategy designed to facilitate the application of DL models to omics datasets. This approach generated a high number of overlapping subsequences with controlled overlaps and shared nucleotide features through a sliding window technique. A hybrid model of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers was applied across augmented datasets comprising genes and proteins from eight microalgae and higher plant chloroplasts. The data augmentation strategy enabled employing DL methods on these datasets and significantly improved the model performance by avoiding common issues such as overfitting and non-representative sequence variations. The current augmentation process is highly adaptable, providing flexibility across different types of biological data repositories. Furthermore, a complementary k-mer-based data augmentation strategy was introduced for unlabeled datasets, enhancing unsupervised analysis. Overall, these innovative strategies provide robust solutions for optimizing model training potential in the study of datasets with limited data availability.
将深度学习(DL)应用于组学和其他生物数据集的一个关键障碍是数据稀缺,尤其是当每个基因或蛋白质由单个序列表示时。这一基本挑战主要与涉及遗传受限生物体、细胞器、特殊细胞类型以及生物周期和途径的研究相关。本研究引入了一种新颖的数据增强策略,旨在促进DL模型在组学数据集上的应用。该方法通过滑动窗口技术生成了大量具有可控重叠和共享核苷酸特征的重叠子序列。卷积神经网络(CNN)和长短期记忆(LSTM)层的混合模型应用于包含来自八种微藻和高等植物叶绿体的基因和蛋白质的增强数据集。数据增强策略使得能够在这些数据集上采用DL方法,并通过避免诸如过拟合和非代表性序列变异等常见问题显著提高了模型性能。当前的增强过程具有高度适应性,为不同类型的生物数据存储库提供了灵活性。此外,还为未标记数据集引入了一种基于互补k-mer的数据增强策略,增强了无监督分析。总体而言,这些创新策略为在数据可用性有限的数据集研究中优化模型训练潜力提供了强大的解决方案。