针对具有有限基因表征的生物数据集（聚焦于叶绿体基因组）的深度学习创新数据增强策略。

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.

作者信息

Abbasi-Vineh Mohammad Ali, Rouzbahani Shirin, Kavousi Kaveh, Emadpour Masoumeh

机构信息

Department of Agricultural Biotechnology, Tarbiat Modares University (TMU), Tehran, 1497713111, Iran.

Department of Bioinformatics, Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.

出版信息

Sci Rep. 2025 Jul 25;15(1):27079. doi: 10.1038/s41598-025-12796-9.

DOI:10.1038/s41598-025-12796-9

PMID:40715495

Abstract

One key barrier to applying deep learning (DL) to omics and other biological datasets is data scarcity, particularly when each gene or protein is represented by a single sequence. This fundamental challenge is mainly relevant in research involving genetically constrained organisms, organelles, specialized cell types, and biological cycles and pathways. This study introduces a novel data augmentation strategy designed to facilitate the application of DL models to omics datasets. This approach generated a high number of overlapping subsequences with controlled overlaps and shared nucleotide features through a sliding window technique. A hybrid model of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers was applied across augmented datasets comprising genes and proteins from eight microalgae and higher plant chloroplasts. The data augmentation strategy enabled employing DL methods on these datasets and significantly improved the model performance by avoiding common issues such as overfitting and non-representative sequence variations. The current augmentation process is highly adaptable, providing flexibility across different types of biological data repositories. Furthermore, a complementary k-mer-based data augmentation strategy was introduced for unlabeled datasets, enhancing unsupervised analysis. Overall, these innovative strategies provide robust solutions for optimizing model training potential in the study of datasets with limited data availability.

摘要

将深度学习（DL）应用于组学和其他生物数据集的一个关键障碍是数据稀缺，尤其是当每个基因或蛋白质由单个序列表示时。这一基本挑战主要与涉及遗传受限生物体、细胞器、特殊细胞类型以及生物周期和途径的研究相关。本研究引入了一种新颖的数据增强策略，旨在促进DL模型在组学数据集上的应用。该方法通过滑动窗口技术生成了大量具有可控重叠和共享核苷酸特征的重叠子序列。卷积神经网络（CNN）和长短期记忆（LSTM）层的混合模型应用于包含来自八种微藻和高等植物叶绿体的基因和蛋白质的增强数据集。数据增强策略使得能够在这些数据集上采用DL方法，并通过避免诸如过拟合和非代表性序列变异等常见问题显著提高了模型性能。当前的增强过程具有高度适应性，为不同类型的生物数据存储库提供了灵活性。此外，还为未标记数据集引入了一种基于互补k-mer的数据增强策略，增强了无监督分析。总体而言，这些创新策略为在数据可用性有限的数据集研究中优化模型训练潜力提供了强大的解决方案。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

针对具有有限基因表征的生物数据集（聚焦于叶绿体基因组）的深度学习创新数据增强策略。

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

针对具有有限基因表征的生物数据集（聚焦于叶绿体基因组）的深度学习创新数据增强策略。

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献