Reddy Aniketh Janardhan, Herschl Michael H, Geng Xinyang, Kolli Sathvik, Lu Amy X, Kumar Aviral, Hsu Patrick D, Levine Sergey, Ioannidis Nilah M
University of California, Berkeley.
bioRxiv. 2024 May 19:2023.02.24.529941. doi: 10.1101/2023.02.24.529941.
The ability to deliver genetic cargo to human cells is enabling rapid progress in molecular medicine, but designing this cargo for precise expression in specific cell types is a major challenge. Expression is driven by regulatory DNA sequences within short synthetic promoters, but relatively few of these promoters are cell-type-specific. The ability to design cell-type-specific promoters using model-based optimization would be impactful for research and therapeutic applications. However, models of expression from short synthetic promoters (promoter-driven expression) are lacking for most cell types due to insufficient training data in those cell types. Although there are many large datasets of both endogenous expression and promoter-driven expression in other cell types, which provide information that could be used for transfer learning, transfer strategies remain largely unexplored for predicting promoter-driven expression. Here, we propose a variety of pretraining tasks, transfer strategies, and model architectures for modelling promoter-driven expression. To thoroughly evaluate various methods, we propose two benchmarks that reflect data-constrained and large dataset settings. In the data-constrained setting, we find that pretraining followed by transfer learning is highly effective, improving performance by 24-27%. In the large dataset setting, transfer learning leads to more modest gains, improving performance by up to 2%. We also propose the best architecture to model promoter-driven expression when training from scratch. The methods we identify are broadly applicable for modelling promoter-driven expression in understudied cell types, and our findings will guide the choice of models that are best suited to designing promoters for gene delivery applications using model-based optimization. Our code and data are available at https://github.com/anikethjr/promoter_models.
将基因载体递送至人类细胞的能力推动了分子医学的快速发展,但设计能在特定细胞类型中精确表达的载体是一项重大挑战。表达由短合成启动子内的调控DNA序列驱动,但这些启动子中相对较少是细胞类型特异性的。利用基于模型的优化设计细胞类型特异性启动子的能力对研究和治疗应用将具有重要意义。然而,由于大多数细胞类型的训练数据不足,缺乏针对短合成启动子表达(启动子驱动表达)的模型。尽管在其他细胞类型中有许多关于内源性表达和启动子驱动表达的大型数据集,这些数据集提供了可用于迁移学习的信息,但在预测启动子驱动表达方面,迁移策略仍未得到充分探索。在此,我们提出了多种预训练任务、迁移策略和模型架构来对启动子驱动表达进行建模。为了全面评估各种方法,我们提出了两个反映数据受限和大型数据集设置的基准。在数据受限的设置中,我们发现先进行预训练再进行迁移学习非常有效,性能提高了24 - 27%。在大型数据集设置中,迁移学习带来的提升较为有限,性能最多提高2%。我们还提出了从零开始训练时对启动子驱动表达进行建模的最佳架构。我们确定的方法广泛适用于对研究较少的细胞类型中的启动子驱动表达进行建模,我们的研究结果将指导选择最适合使用基于模型的优化设计用于基因递送应用的启动子的模型。我们的代码和数据可在https://github.com/anikethjr/promoter_models获取。