• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对具有有限基因表征的生物数据集(聚焦于叶绿体基因组)的深度学习创新数据增强策略。

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.

作者信息

Abbasi-Vineh Mohammad Ali, Rouzbahani Shirin, Kavousi Kaveh, Emadpour Masoumeh

机构信息

Department of Agricultural Biotechnology, Tarbiat Modares University (TMU), Tehran, 1497713111, Iran.

Department of Bioinformatics, Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.

出版信息

Sci Rep. 2025 Jul 25;15(1):27079. doi: 10.1038/s41598-025-12796-9.

DOI:10.1038/s41598-025-12796-9
PMID:40715495
Abstract

One key barrier to applying deep learning (DL) to omics and other biological datasets is data scarcity, particularly when each gene or protein is represented by a single sequence. This fundamental challenge is mainly relevant in research involving genetically constrained organisms, organelles, specialized cell types, and biological cycles and pathways. This study introduces a novel data augmentation strategy designed to facilitate the application of DL models to omics datasets. This approach generated a high number of overlapping subsequences with controlled overlaps and shared nucleotide features through a sliding window technique. A hybrid model of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers was applied across augmented datasets comprising genes and proteins from eight microalgae and higher plant chloroplasts. The data augmentation strategy enabled employing DL methods on these datasets and significantly improved the model performance by avoiding common issues such as overfitting and non-representative sequence variations. The current augmentation process is highly adaptable, providing flexibility across different types of biological data repositories. Furthermore, a complementary k-mer-based data augmentation strategy was introduced for unlabeled datasets, enhancing unsupervised analysis. Overall, these innovative strategies provide robust solutions for optimizing model training potential in the study of datasets with limited data availability.

摘要

将深度学习(DL)应用于组学和其他生物数据集的一个关键障碍是数据稀缺,尤其是当每个基因或蛋白质由单个序列表示时。这一基本挑战主要与涉及遗传受限生物体、细胞器、特殊细胞类型以及生物周期和途径的研究相关。本研究引入了一种新颖的数据增强策略,旨在促进DL模型在组学数据集上的应用。该方法通过滑动窗口技术生成了大量具有可控重叠和共享核苷酸特征的重叠子序列。卷积神经网络(CNN)和长短期记忆(LSTM)层的混合模型应用于包含来自八种微藻和高等植物叶绿体的基因和蛋白质的增强数据集。数据增强策略使得能够在这些数据集上采用DL方法,并通过避免诸如过拟合和非代表性序列变异等常见问题显著提高了模型性能。当前的增强过程具有高度适应性,为不同类型的生物数据存储库提供了灵活性。此外,还为未标记数据集引入了一种基于互补k-mer的数据增强策略,增强了无监督分析。总体而言,这些创新策略为在数据可用性有限的数据集研究中优化模型训练潜力提供了强大的解决方案。

相似文献

1
Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.针对具有有限基因表征的生物数据集(聚焦于叶绿体基因组)的深度学习创新数据增强策略。
Sci Rep. 2025 Jul 25;15(1):27079. doi: 10.1038/s41598-025-12796-9.
2
Short-Term Memory Impairment短期记忆障碍
3
GAN-enhanced deep learning for improved Alzheimer's disease classification and longitudinal brain change analysis.用于改善阿尔茨海默病分类和纵向脑变化分析的生成对抗网络增强深度学习
Front Med (Lausanne). 2025 Jun 17;12:1587026. doi: 10.3389/fmed.2025.1587026. eCollection 2025.
4
Development and Validation of a Convolutional Neural Network Model to Predict a Pathologic Fracture in the Proximal Femur Using Abdomen and Pelvis CT Images of Patients With Advanced Cancer.利用晚期癌症患者腹部和骨盆 CT 图像建立卷积神经网络模型预测股骨近端病理性骨折的研究
Clin Orthop Relat Res. 2023 Nov 1;481(11):2247-2256. doi: 10.1097/CORR.0000000000002771. Epub 2023 Aug 23.
5
Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果:一种针对特定个体见解的新型验证方法。
Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.
6
A medical image classification method based on self-regularized adversarial learning.基于自正则化对抗学习的医学图像分类方法。
Med Phys. 2024 Nov;51(11):8232-8246. doi: 10.1002/mp.17320. Epub 2024 Jul 30.
7
Advancing respiratory disease diagnosis: A deep learning and vision transformer-based approach with a novel X-ray dataset.推进呼吸系统疾病诊断:一种基于深度学习和视觉Transformer的方法及新型X射线数据集
Comput Biol Med. 2025 Aug;194:110501. doi: 10.1016/j.compbiomed.2025.110501. Epub 2025 Jun 9.
8
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
9
Deciphering Shared Gene Signatures and Immune Infiltration Characteristics Between Gestational Diabetes Mellitus and Preeclampsia by Integrated Bioinformatics Analysis and Machine Learning.通过综合生物信息学分析和机器学习破译妊娠期糖尿病和子痫前期之间共享的基因特征及免疫浸润特征
Reprod Sci. 2025 May 15. doi: 10.1007/s43032-025-01847-1.
10
Systemic treatments for metastatic cutaneous melanoma.转移性皮肤黑色素瘤的全身治疗
Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.

本文引用的文献

1
Molecular dynamics simulations of ribosome-binding sites in theophylline-responsive riboswitch associated with improving the gene expression regulation in chloroplasts.茶碱响应型核糖体结合位点的核糖体结合位点的分子动力学模拟与提高叶绿体中基因表达调控有关。
J Bioinform Comput Biol. 2024 Oct;22(5):2450023. doi: 10.1142/S0219720024500239. Epub 2024 Oct 30.
2
The First Introduction of an Exogenous 5' Untranslated Region for Control of Plastid Transgene Expression in Chlamydomonas reinhardtii.首次引入外源5'非翻译区以控制莱茵衣藻中质体转基因的表达。
Mol Biotechnol. 2024 Sep 13. doi: 10.1007/s12033-024-01279-3.
3
Deep learning in bioinformatics.
生物信息学中的深度学习。
Turk J Biol. 2023 Dec 18;47(6):366-382. doi: 10.55730/1300-0152.2671. eCollection 2023.
4
Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation.利用系统发育增强提高监管基因组学中监督深度学习的性能。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae190.
5
Data Augmentation Enhances Plant-Genomic-Enabled Predictions.数据增强提高了基于植物基因组的预测能力。
Genes (Basel). 2024 Feb 24;15(3):286. doi: 10.3390/genes15030286.
6
Quiescence enables unrestricted cell fate in naive embryonic stem cells.静息状态使原始胚胎干细胞中的细胞命运不受限制。
Nat Commun. 2024 Feb 26;15(1):1721. doi: 10.1038/s41467-024-46121-1.
7
Machine Learning Methods for Small Data Challenges in Molecular Science.机器学习方法在分子科学中小数据挑战中的应用。
Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.
8
Editorial: Advances in plastid biology and its applications, volume II.社论:质体生物学及其应用进展,第二卷。
Front Plant Sci. 2023 May 31;14:1203554. doi: 10.3389/fpls.2023.1203554. eCollection 2023.
9
Transfer learning enables predictions in network biology.迁移学习可实现网络生物学预测。
Nature. 2023 Jun;618(7965):616-624. doi: 10.1038/s41586-023-06139-9. Epub 2023 May 31.
10
EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations.EvoAug:利用受进化启发的数据增强方法提高基因组深度学习神经网络的泛化能力和可解释性。
Genome Biol. 2023 May 5;24(1):105. doi: 10.1186/s13059-023-02941-w.