使用迁移学习有效模拟启动子驱动基因表达的策略。

Strategies for effectively modelling promoter-driven gene expression using transfer learning.

作者信息

Reddy Aniketh Janardhan, Herschl Michael H, Geng Xinyang, Kolli Sathvik, Lu Amy X, Kumar Aviral, Hsu Patrick D, Levine Sergey, Ioannidis Nilah M

机构信息

University of California, Berkeley.

出版信息

bioRxiv. 2024 May 19:2023.02.24.529941. doi: 10.1101/2023.02.24.529941.

DOI:10.1101/2023.02.24.529941

PMID:36909524

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10002662/

Abstract

The ability to deliver genetic cargo to human cells is enabling rapid progress in molecular medicine, but designing this cargo for precise expression in specific cell types is a major challenge. Expression is driven by regulatory DNA sequences within short synthetic promoters, but relatively few of these promoters are cell-type-specific. The ability to design cell-type-specific promoters using model-based optimization would be impactful for research and therapeutic applications. However, models of expression from short synthetic promoters (promoter-driven expression) are lacking for most cell types due to insufficient training data in those cell types. Although there are many large datasets of both endogenous expression and promoter-driven expression in other cell types, which provide information that could be used for transfer learning, transfer strategies remain largely unexplored for predicting promoter-driven expression. Here, we propose a variety of pretraining tasks, transfer strategies, and model architectures for modelling promoter-driven expression. To thoroughly evaluate various methods, we propose two benchmarks that reflect data-constrained and large dataset settings. In the data-constrained setting, we find that pretraining followed by transfer learning is highly effective, improving performance by 24-27%. In the large dataset setting, transfer learning leads to more modest gains, improving performance by up to 2%. We also propose the best architecture to model promoter-driven expression when training from scratch. The methods we identify are broadly applicable for modelling promoter-driven expression in understudied cell types, and our findings will guide the choice of models that are best suited to designing promoters for gene delivery applications using model-based optimization. Our code and data are available at https://github.com/anikethjr/promoter_models.

摘要

将基因载体递送至人类细胞的能力推动了分子医学的快速发展，但设计能在特定细胞类型中精确表达的载体是一项重大挑战。表达由短合成启动子内的调控DNA序列驱动，但这些启动子中相对较少是细胞类型特异性的。利用基于模型的优化设计细胞类型特异性启动子的能力对研究和治疗应用将具有重要意义。然而，由于大多数细胞类型的训练数据不足，缺乏针对短合成启动子表达（启动子驱动表达）的模型。尽管在其他细胞类型中有许多关于内源性表达和启动子驱动表达的大型数据集，这些数据集提供了可用于迁移学习的信息，但在预测启动子驱动表达方面，迁移策略仍未得到充分探索。在此，我们提出了多种预训练任务、迁移策略和模型架构来对启动子驱动表达进行建模。为了全面评估各种方法，我们提出了两个反映数据受限和大型数据集设置的基准。在数据受限的设置中，我们发现先进行预训练再进行迁移学习非常有效，性能提高了24 - 27%。在大型数据集设置中，迁移学习带来的提升较为有限，性能最多提高2%。我们还提出了从零开始训练时对启动子驱动表达进行建模的最佳架构。我们确定的方法广泛适用于对研究较少的细胞类型中的启动子驱动表达进行建模，我们的研究结果将指导选择最适合使用基于模型的优化设计用于基因递送应用的启动子的模型。我们的代码和数据可在https://github.com/anikethjr/promoter_models获取。

相似文献

Strategies for effectively modelling promoter-driven gene expression using transfer learning.使用迁移学习有效模拟启动子驱动基因表达的策略。

bioRxiv. 2024 May 19:2023.02.24.529941. doi: 10.1101/2023.02.24.529941.

Designing Cell-Type-Specific Promoter Sequences Using Conservative Model-Based Optimization.使用基于保守模型的优化方法设计细胞类型特异性启动子序列

bioRxiv. 2024 Jun 23:2024.06.23.600232. doi: 10.1101/2024.06.23.600232.

DeepPHiC: predicting promoter-centered chromatin interactions using a novel deep learning approach.DeepPHiC：使用新型深度学习方法预测以启动子为中心的染色质相互作用。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac801.

Combining diffusion and transformer models for enhanced promoter synthesis and strength prediction in deep learning.结合扩散模型和变压器模型以增强深度学习中启动子的合成及强度预测

mSystems. 2025 Apr 22;10(4):e0018325. doi: 10.1128/msystems.00183-25. Epub 2025 Mar 19.

iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features.iPromoter-Seqvec：使用双向长短时记忆和序列嵌入特征识别启动子。

BMC Genomics. 2022 Oct 3;23(Suppl 5):681. doi: 10.1186/s12864-022-08829-6.

EVMP: enhancing machine learning models for synthetic promoter strength prediction by Extended Vision Mutant Priority framework.EVMP：通过扩展视觉突变体优先级框架增强用于合成启动子强度预测的机器学习模型。

Front Microbiol. 2023 Jul 5;14:1215609. doi: 10.3389/fmicb.2023.1215609. eCollection 2023.

PromoterPredict: sequence-based modelling of σ promoter strength yields logarithmic dependence between promoter strength and sequence.启动子预测：基于序列的σ启动子强度建模得出启动子强度与序列之间的对数依赖性。

PeerJ. 2018 Nov 7;6:e5862. doi: 10.7717/peerj.5862. eCollection 2018.

ChampKit: A framework for rapid evaluation of deep neural networks for patch-based histopathology classification.ChampKit：一种基于补丁的组织病理学分类的深度神经网络快速评估框架。

Comput Methods Programs Biomed. 2023 Sep;239:107631. doi: 10.1016/j.cmpb.2023.107631. Epub 2023 May 30.

Transfer learning for drug-target interaction prediction.药物-靶标相互作用预测的迁移学习。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i103-i110. doi: 10.1093/bioinformatics/btad234.

Model-driven generation of artificial yeast promoters.基于模型的人工酵母启动子生成。

Nat Commun. 2020 Apr 30;11(1):2113. doi: 10.1038/s41467-020-15977-4.

引用本文的文献

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景：对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。

Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.

本文引用的文献

Massively parallel characterization of transcriptional regulatory elements.转录调控元件的大规模并行表征

Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.

DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。

Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.

MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction.木兰-甲基-多变压器语言模型，用于准确预测 DNA 甲基化。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad054. Epub 2023 Jul 25.

Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria.自动化模型预测设计合成启动子，以控制细菌中的转录谱。

Nat Commun. 2022 Sep 2;13(1):5159. doi: 10.1038/s41467-022-32829-5.

The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans.智慧人图谱：人类多器官单细胞转录组图谱。

Science. 2022 May 13;376(6594):eabl4896. doi: 10.1126/science.abl4896.

Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用，从序列中有效预测基因表达。

Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.

Learning the Regulatory Code of Gene Expression.学习基因表达的调控密码。

Front Mol Biosci. 2021 Jun 10;8:673363. doi: 10.3389/fmolb.2021.673363. eCollection 2021.

Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters.通过对植物核心启动子的全面分析，实现了合成启动子的设计。

Nat Plants. 2021 Jun;7(6):842-855. doi: 10.1038/s41477-021-00932-y. Epub 2021 Jun 3.

SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files.SeqFu：一套用于对序列文件进行稳健且可重复操作的实用工具。

Bioengineering (Basel). 2021 May 7;8(5):59. doi: 10.3390/bioengineering8050059.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验