• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用系统发育增强提高监管基因组学中监督深度学习的性能。

Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation.

作者信息

Duncan Andrew G, Mitchell Jennifer A, Moses Alan M

机构信息

Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada.

出版信息

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae190.

DOI:10.1093/bioinformatics/btae190
PMID:38588559
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11042905/
Abstract

MOTIVATION

Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited.

RESULTS

Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics.

AVAILABILITY AND IMPLEMENTATION

The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.

摘要

动机

监督深度学习用于对基因组序列与调控功能之间的复杂关系进行建模。了解这些模型如何进行预测可以为调控功能提供生物学见解。鉴于从序列到调控功能映射(顺式调控密码)的复杂性,有人提出基因组中包含的序列变异不足以训练具有适当复杂度的模型。数据增强是一种广泛使用的方法,用于增加可用于模型训练的数据变化,然而,目前用于基因组序列数据的数据增强方法是有限的。

结果

受比较基因组学成功的启发,我们表明,用来自其他物种的进化相关序列增强基因组序列(我们称之为系统发育增强),可以提高在调控基因组序列上训练的深度学习模型预测高通量功能测定测量值的性能。此外,我们表明,当训练集进行下采样时,系统发育增强可以挽救模型性能,并允许在真实世界的小数据集上进行深度学习,这表明这种方法提高了数据效率。总体而言,这种数据增强方法代表了一种提高模型性能的解决方案,适用于基因组学中的许多监督深度学习问题。

可用性和实现

开源的GitHub仓库agduncan94/phylogenetic_augmentation_paper包含了重新运行此处分析和重新创建图表的代码。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/973f3732148b/btae190f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/a75521f2bc61/btae190f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/56087d82a765/btae190f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/1986857d5af7/btae190f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/973f3732148b/btae190f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/a75521f2bc61/btae190f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/56087d82a765/btae190f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/1986857d5af7/btae190f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56c9/11042905/973f3732148b/btae190f4.jpg

相似文献

1
Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation.利用系统发育增强提高监管基因组学中监督深度学习的性能。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae190.
2
A real use case of semi-supervised learning for mammogram classification in a local clinic of Costa Rica.半监督学习在哥斯达黎加当地诊所的乳房 X 光分类中的实际应用案例。
Med Biol Eng Comput. 2022 Apr;60(4):1159-1175. doi: 10.1007/s11517-021-02497-6. Epub 2022 Mar 3.
3
Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences.半监督学习利用未标记序列提高调控序列预测。
BMC Bioinformatics. 2023 May 5;24(1):186. doi: 10.1186/s12859-023-05303-2.
4
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.基于监督深度学习方法的全基因组顺式调控区预测。
BMC Bioinformatics. 2018 May 31;19(1):202. doi: 10.1186/s12859-018-2187-1.
5
Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.半监督学习结合伪标签在调控序列预测方面优于大型语言模型。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae560.
6
A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations.一种用于预测基因组非编码变异功能效应的半监督深度学习方法。
BMC Bioinformatics. 2021 Jun 2;22(Suppl 6):128. doi: 10.1186/s12859-021-03999-8.
7
A self-supervised deep learning method for data-efficient training in genomics.一种用于基因组学中数据高效训练的自监督深度学习方法。
Commun Biol. 2023 Sep 11;6(1):928. doi: 10.1038/s42003-023-05310-2.
8
Mantis-ml: Disease-Agnostic Gene Prioritization from High-Throughput Genomic Screens by Stochastic Semi-supervised Learning.Mantis-ml:基于随机半监督学习的高通量基因组筛选中的疾病非特异性基因优先级排序。
Am J Hum Genet. 2020 May 7;106(5):659-678. doi: 10.1016/j.ajhg.2020.03.012.
9
Combining weakly and strongly supervised learning improves strong supervision in Gleason pattern classification.弱监督和强监督学习的结合提高了 Gleason 模式分类中的强监督。
BMC Med Imaging. 2021 May 8;21(1):77. doi: 10.1186/s12880-021-00609-0.
10
Assessing the reliability of point mutation as data augmentation for deep learning with genomic data.评估点突变作为基因组数据深度学习数据增强的可靠性。
BMC Bioinformatics. 2024 Apr 30;25(1):170. doi: 10.1186/s12859-024-05787-6.

引用本文的文献

1
Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes.针对具有有限基因表征的生物数据集(聚焦于叶绿体基因组)的深度学习创新数据增强策略。
Sci Rep. 2025 Jul 25;15(1):27079. doi: 10.1038/s41598-025-12796-9.
2
Combining Machine Learning and Multiplexed, Profiling to Engineer Cell Type and Behavioral Specificity.结合机器学习与多重分析来设计细胞类型和行为特异性。
bioRxiv. 2025 Jun 21:2025.06.20.660790. doi: 10.1101/2025.06.20.660790.
3
Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.

本文引用的文献

1
Single-fly genome assemblies fill major phylogenomic gaps across the Drosophilidae Tree of Life.单蝇基因组组装填补了果蝇科生命之树的主要系统发育基因组空白。
PLoS Biol. 2024 Jul 18;22(7):e3002697. doi: 10.1371/journal.pbio.3002697. eCollection 2024 Jul.
2
Hold out the genome: a roadmap to solving the cis-regulatory code.伸出基因组:解决顺式调控代码的路线图。
Nature. 2024 Jan;625(7993):41-50. doi: 10.1038/s41586-023-06661-w. Epub 2023 Dec 13.
3
Identification of constrained sequence elements across 239 primate genomes.在239个灵长类基因组中鉴定受限序列元件
半监督学习结合伪标签在调控序列预测方面优于大型语言模型。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae560.
4
EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow.EvoAug-TF:将基于进化的基因组深度学习数据增强扩展到 TensorFlow。
Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae092.
5
EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow.EvoAug-TF:将受进化启发的基因组深度学习数据增强扩展到TensorFlow。
bioRxiv. 2024 Jan 18:2024.01.17.575961. doi: 10.1101/2024.01.17.575961.
Nature. 2024 Jan;625(7996):735-742. doi: 10.1038/s41586-023-06798-8. Epub 2023 Nov 29.
4
ExplaiNN: interpretable and transparent neural networks for genomics.ExplaiNN:基因组学的可解释和透明神经网络。
Genome Biol. 2023 Jun 27;24(1):154. doi: 10.1186/s13059-023-02985-y.
5
Evaluating deep learning for predicting epigenomic profiles.评估用于预测表观基因组图谱的深度学习。
Nat Mach Intell. 2022 Dec;4(12):1088-1100. doi: 10.1038/s42256-022-00570-9. Epub 2022 Dec 5.
6
EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations.EvoAug:利用受进化启发的数据增强方法提高基因组深度学习神经网络的泛化能力和可解释性。
Genome Biol. 2023 May 5;24(1):105. doi: 10.1186/s13059-023-02941-w.
7
Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.利用进化进行对比学习来发现无序区域的分子特征。
PLoS Comput Biol. 2022 Jun 29;18(6):e1010238. doi: 10.1371/journal.pcbi.1010238. eCollection 2022 Jun.
8
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers.DeepSTARR 可根据 DNA 序列预测增强子活性,并能够从头设计合成增强子。
Nat Genet. 2022 May;54(5):613-624. doi: 10.1038/s41588-022-01048-5. Epub 2022 May 12.
9
Database resources of the national center for biotechnology information.国家生物技术信息中心数据库资源。
Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112.
10
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.