Suppr超能文献

使用迁移学习有效模拟启动子驱动基因表达的策略。

Strategies for effectively modelling promoter-driven gene expression using transfer learning.

作者信息

Reddy Aniketh Janardhan, Herschl Michael H, Geng Xinyang, Kolli Sathvik, Lu Amy X, Kumar Aviral, Hsu Patrick D, Levine Sergey, Ioannidis Nilah M

机构信息

University of California, Berkeley.

出版信息

bioRxiv. 2024 May 19:2023.02.24.529941. doi: 10.1101/2023.02.24.529941.

Abstract

The ability to deliver genetic cargo to human cells is enabling rapid progress in molecular medicine, but designing this cargo for precise expression in specific cell types is a major challenge. Expression is driven by regulatory DNA sequences within short synthetic promoters, but relatively few of these promoters are cell-type-specific. The ability to design cell-type-specific promoters using model-based optimization would be impactful for research and therapeutic applications. However, models of expression from short synthetic promoters (promoter-driven expression) are lacking for most cell types due to insufficient training data in those cell types. Although there are many large datasets of both endogenous expression and promoter-driven expression in other cell types, which provide information that could be used for transfer learning, transfer strategies remain largely unexplored for predicting promoter-driven expression. Here, we propose a variety of pretraining tasks, transfer strategies, and model architectures for modelling promoter-driven expression. To thoroughly evaluate various methods, we propose two benchmarks that reflect data-constrained and large dataset settings. In the data-constrained setting, we find that pretraining followed by transfer learning is highly effective, improving performance by 24-27%. In the large dataset setting, transfer learning leads to more modest gains, improving performance by up to 2%. We also propose the best architecture to model promoter-driven expression when training from scratch. The methods we identify are broadly applicable for modelling promoter-driven expression in understudied cell types, and our findings will guide the choice of models that are best suited to designing promoters for gene delivery applications using model-based optimization. Our code and data are available at https://github.com/anikethjr/promoter_models.

摘要

将基因载体递送至人类细胞的能力推动了分子医学的快速发展,但设计能在特定细胞类型中精确表达的载体是一项重大挑战。表达由短合成启动子内的调控DNA序列驱动,但这些启动子中相对较少是细胞类型特异性的。利用基于模型的优化设计细胞类型特异性启动子的能力对研究和治疗应用将具有重要意义。然而,由于大多数细胞类型的训练数据不足,缺乏针对短合成启动子表达(启动子驱动表达)的模型。尽管在其他细胞类型中有许多关于内源性表达和启动子驱动表达的大型数据集,这些数据集提供了可用于迁移学习的信息,但在预测启动子驱动表达方面,迁移策略仍未得到充分探索。在此,我们提出了多种预训练任务、迁移策略和模型架构来对启动子驱动表达进行建模。为了全面评估各种方法,我们提出了两个反映数据受限和大型数据集设置的基准。在数据受限的设置中,我们发现先进行预训练再进行迁移学习非常有效,性能提高了24 - 27%。在大型数据集设置中,迁移学习带来的提升较为有限,性能最多提高2%。我们还提出了从零开始训练时对启动子驱动表达进行建模的最佳架构。我们确定的方法广泛适用于对研究较少的细胞类型中的启动子驱动表达进行建模,我们的研究结果将指导选择最适合使用基于模型的优化设计用于基因递送应用的启动子的模型。我们的代码和数据可在https://github.com/anikethjr/promoter_models获取。

相似文献

9
Transfer learning for drug-target interaction prediction.药物-靶标相互作用预测的迁移学习。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i103-i110. doi: 10.1093/bioinformatics/btad234.
10
Model-driven generation of artificial yeast promoters.基于模型的人工酵母启动子生成。
Nat Commun. 2020 Apr 30;11(1):2113. doi: 10.1038/s41467-020-15977-4.

本文引用的文献

1
Massively parallel characterization of transcriptional regulatory elements.转录调控元件的大规模并行表征
Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.
2
DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.
7
Learning the Regulatory Code of Gene Expression.学习基因表达的调控密码。
Front Mol Biosci. 2021 Jun 10;8:673363. doi: 10.3389/fmolb.2021.673363. eCollection 2021.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验